Summary:

New Columns:

  • handheld - (boolean) whether ticket was from a handheld or handwritten. Obtain from whether ticket_number is in 'SFMTA Data GPS 2015 - 2017'
  • gps_missing_record - (boolean) whether hh_gps was missing from the start
  • imputed - (boolean) was hh_gps imputed
  • address - (str) combined tick_street_no, tick_street_name, and tick_street_suffix
  • issue_datetime - (datetime) combined tick_issue_date and tick_issue_time
  • state_full - (str) the non-abbreviated name of the state. Abbreviation not part of the USA's 50 state are converted to 'Other'
  • violation_desc_general - (str) manually binned similar violation description together (i.e Invalid/Expired Permit == No Permit or Parking Prohibited == Parking Restricted)
  • violation_general - (str) manually binned similar violations together but broader than violation_desc_general (i.e Any violations related to meter like meter overtime, expired, outside line are Meter Related violations)

Imputation

  • tick_issue_time - imputation using the mode obtained by grouping by violation_desc_long (~3k imputed)
  • tick_street_no - corrected some that were clearly outliers because of an extra digit at the end
  • tick_street_name - manually edit some names which that they are most consistent with each other (i.e PIER 01 == PIER1 == PIER 1 THE EMBARCADERO)
  • tick_street_name - rename some suffix for consistency and remove uncommon ones
  • hh_gpslat & hh_gpslon - first imputed using the median gps value based on exact same address (i.e 400 MARKET == 400 MARKET != 400 MARKET ST), second using same method but based on the block of a street (i.e 400 MARKET to 499 MARKET converted to 400 MARKET), third is by using geocoding to fix the last bit. (1st level fixes ~400k, 2nd level ~900k, 3rd level ~40k)

Note I did not update the rest of the notebook after adding geopandas method in filtering out rows that were outside san franciscos. Because only ~301 rows were changed and they were points that won't impact the other methods and analyze by much or at all. 2/6/2018

In [1]:
import numpy as np
import pandas as pd
import os
import time
import pickle
import matplotlib.pyplot as plt
%matplotlib inline
import cartopy.crs as ccrs
import cartopy.feature as cfeat
from IPython.display import Image
import geocoder
import geopandas as gpd
from geopandas.tools import sjoin
import shapely
from shapely.geometry import Point
shapely.speedups.enable()
In [2]:
## View all columns
pd.set_option('max.columns', 30)
In [3]:
file = os.path.dirname('__filename__')
fp_df = os.path.join(file, 'dataset', 'SFMTA Data 2015 - 2017.zip')
fp_gps = os.path.join(file, 'dataset', 'SFMTA Data GPS 2015 - 2017.zip')
In [4]:
## Dates and time will be converted later
## low memory because ticket number is a mixture of integer and strings
df = pd.read_csv(fp_df, compression = 'zip',parse_dates= False, low_memory=True)
df_gps = pd.read_csv(fp_gps, compression='zip', parse_dates = False, low_memory=True)

## make columns name consistent and easier to use
df.columns = df.columns.str.lower().str.replace(' ', '_')
df_gps.columns = df_gps.columns.str.lower().str.replace(' ', '_')
In [5]:
print('Shape of df: {}'.format(df.shape))
print('Shape of df_gps: {}'.format(df_gps.shape))
Shape of df: (3923066, 25)
Shape of df_gps: (3785819, 5)
In [6]:
df.head(3)
Out[6]:
tick_issue_date issue_mnth_nbr issue_day_of_month issue_year issue_mnth issue_day tick_issue_hour tick_issue_time badge_# ticket_number ticket_type_code tick_vin plate_state plate veh_body violation_code violation violation_desc_long tick_street_no tick_street_name tick_street_suffix tick_rp_vin tick_corr_type tick_corr_type_desc_long tkt_cnt
0 2015/05/13 00:00:00 5 13 2015 MAY WED 15.0 15:50 286.0 854007081 P 0011 CA 5ZCL954 SW 377 TRC7.2.23B METER OUT DOWNTOWN 4420.0 MISSION ST 4S4BP61CX87310011 NaN NaN 1
1 2015/05/13 00:00:00 5 13 2015 MAY WED 13.0 13:26 395.0 855558480 P 1969 CA 8U22714 PK 375 TRC7.2.22 STREET CLEANING 834.0 47TH AVE 1GCEC14X98Z291969 NaN NaN 1
2 2015/05/13 00:00:00 5 13 2015 MAY WED 11.0 11:59 193.0 855017295 P TOO HIGH CA 8C50399 TK 179 V22500H DOUBLE PARKING 2284.0 MISSION ST RB - 6 NaN NaN 1
In [7]:
df_gps.head(3)
Out[7]:
tick_issue_date ticket_number hh_gpslat hh_gpslon tkt_cnt
0 2015/01/01 00:00:00 849682293 37.802057 -122.426795 1
1 2015/01/01 00:00:00 849682352 37.804563 -122.423215 1
2 2015/01/03 00:00:00 848917591 37.791143 -122.424453 1

Determine if gps coordinates are within San Francisco's boundaries

This part was added at the end so the ordering of the whole notebook is a little werid.

In [8]:
## From https://www.census.gov/geo/maps-data/data/tiger.html
## Contains boundaries for all counties in USA
tmp = gpd.GeoDataFrame.from_file('dataset/shapefile/cb_2016_us_county_500k')
## Obtain only San Francisco boundary
sf = tmp[((tmp.COUNTYFP == '075') & (tmp.STATEFP == '06'))].copy()
sf
Out[8]:
STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME LSAD ALAND AWATER geometry
2112 06 075 00277302 0500000US06075 06075 San Francisco 06 121455687 479136515 (POLYGON ((-123.013916 37.700355, -123.007786 ...
In [9]:
## There are 3 other smaller polygons. Part of San Francisco?
sf.plot()
plt.title('Boundary of San Francisco')
plt.show()
In [10]:
## gps that are zero are nan
df_gps.loc[df_gps.hh_gpslat == 0, 'hh_gpslat'] = np.nan
df_gps.loc[df_gps.hh_gpslon == 0, 'hh_gpslon'] = np.nan
In [11]:
## gps outside of bounding box is nan
# Latitude = 37.70, 37.85
outside = (df_gps.hh_gpslat < 37.7) | (df_gps.hh_gpslat > 37.85)
df_gps.loc[outside, 'hh_gpslat'] = np.nan
df_gps.loc[outside, 'hh_gpslon'] = np.nan
print(outside.sum())
# longitude = -122.54, -122.32
outside = (df_gps.hh_gpslon < -122.54) | (df_gps.hh_gpslon > -122.32)
df_gps.loc[outside, 'hh_gpslat'] = np.nan
df_gps.loc[outside, 'hh_gpslon'] = np.nan
print(outside.sum())
22
3
In [12]:
print('Total number of nan: {:,}'.format(df_gps.hh_gpslat.isna().sum()))
Total number of nan: 1,282,598
In [13]:
## Only classify rows with gps data
tmp = df_gps.loc[df_gps.hh_gpslat.notna()].copy()
tmp.reset_index(inplace=True)
## projection is just assumed
gps = gpd.GeoDataFrame(tmp[['index', 'ticket_number']], crs ={'init': 'epsg:4269'},
                 geometry = [Point(xy) for xy in zip(tmp.hh_gpslon, tmp.hh_gpslat)])
gps.head(2)
Out[13]:
index ticket_number geometry
0 0 849682293 POINT (-122.426795 37.802057)
1 1 849682352 POINT (-122.423215 37.804563)
In [14]:
## Determine if gps coordinates are inside polygon
inOut = sjoin(gps, sf[['NAME', 'geometry']], how = 'left')
print('Number of rows outside of SF: {}'.format(inOut.NAME.isna().sum()))
Number of rows outside of SF: 301
In [15]:
base = sf.plot(color = 'white', edgecolor = 'black', figsize=(30,5))
## Blue if inside
inOut[inOut.NAME.notna()].plot(ax=base, marker='o', color='blue',markersize=10)
## Red if outside
inOut[inOut.NAME.isna()].plot(ax=base, marker='o', color='red',markersize=10)
base.set_xlim(-122.54, -122.32)
base.set_ylim(37.70, 37.85)
plt.title('Points Inside vs Outside')
plt.show()
In [16]:
## set index
inOut.set_index('index', inplace=True)
In [17]:
## find points outside of SF
outside = inOut[inOut.NAME.isna()]
outside.head(3)
Out[17]:
ticket_number geometry index_right NAME
index
13874 854956255 POINT (-122.428615 37.70802800000001) NaN NaN
33119 860943952 POINT (-122.457067 37.708193) NaN NaN
33615 861482694 POINT (-122.429285 37.708177) NaN NaN
In [18]:
## Convert all points outside SF to nan
df_gps.iloc[outside.index, df_gps.columns.get_loc('hh_gpslat')] = np.nan
df_gps.iloc[outside.index, df_gps.columns.get_loc('hh_gpslon')] = np.nan
print('Total number of nan after: {:,}'.format(df_gps.hh_gpslat.isna().sum()))
Total number of nan after: 1,282,899
In [19]:
## plot gps points on sf boundaries
def sfplot(pt, markersize=10, imputed=False, limit=False):
    ## Exlcude rows that are missing gps
    pt = pt[pt.hh_gpslat.notna()]
    ## Extract only the imputed points
    if imputed:
        pt = pt[pt.imputed]
    ## Create a geodataframe for plotting
    gps = gpd.GeoDataFrame(pt[['ticket_number']], crs ={'init': 'epsg:4269'},
                 geometry = [Point(xy) for xy in zip(pt.hh_gpslon, pt.hh_gpslat)])
    ## the base map (SF)
    base = sf.plot(color='white', edgecolor='black', figsize=(30,5))
    gps.plot(ax=base, marker='o', color='red', markersize=10)
    
    ## Focus on SF only
    if limit:
        plt.xlim(-122.54, -122.32)
        plt.ylim(37.70, 37.85)
    ## show plot
    plt.show()
In [20]:
## Show that all points outside sf are removed
sfplot(df_gps)
In [21]:
## wrap up
del tmp, base, inOut, gps, outside

Ticket Type Code

In [22]:
df.ticket_type_code.unique()
Out[22]:
array(['P'], dtype=object)

This column only contains the letter 'P', so not a useful column

In [23]:
df.drop(columns = ['ticket_type_code'], inplace = True)

Ticket Number

In [24]:
## gps data is from hand held computer while the rest are hand written or other special citation
df_gps['handheld'] = True
In [25]:
## Ticket should all be unique
print('All ticket number for df is unique: {}'\
      .format(df.ticket_number.unique().shape[0] == df.shape[0]))
print('All ticket number for df_gps is unique: {}'\
      .format(df_gps.ticket_number.unique().shape[0] == df_gps.shape[0]))
print('df nan count: {}'.format(df.ticket_number.isna().sum()))
print('df_gps nan count: {}'.format(df_gps.ticket_number.isna().sum()))
All ticket number for df is unique: True
All ticket number for df_gps is unique: False
df nan count: 0
df_gps nan count: 0

Remove df_gps duplicates

In [26]:
tic_num_dup = df_gps.duplicated(subset = 'ticket_number', keep = False)
print('Number of duplicates for gps: {}'.format(tic_num_dup.sum()/2))
Number of duplicates for gps: 2.0
In [27]:
df_gps[tic_num_dup]
Out[27]:
tick_issue_date ticket_number hh_gpslat hh_gpslon tkt_cnt handheld
1388066 2017/02/21 00:00:00 874320112 NaN NaN 1 True
2451417 2017/02/21 00:00:00 874320112 37.760225 -122.46885 1 True
3117279 2015/12/30 00:00:00 863464114 37.739940 -122.46737 1 True
3354438 2015/12/30 00:00:00 863464114 NaN NaN 1 True
In [28]:
df[df.ticket_number.isin(['874320112', '863464114'])]
Out[28]:
tick_issue_date issue_mnth_nbr issue_day_of_month issue_year issue_mnth issue_day tick_issue_hour tick_issue_time badge_# ticket_number tick_vin plate_state plate veh_body violation_code violation violation_desc_long tick_street_no tick_street_name tick_street_suffix tick_rp_vin tick_corr_type tick_corr_type_desc_long tkt_cnt
869614 2017/02/21 00:00:00 2 21 2017 FEBRUARY TUE 18.0 18:05 116.0 874320112 CAN NOT READ CA 7JPJ273 4D 179 V22500H DOUBLE PARKING 740.0 KIRKHAM ST KNDJP3A52F7761656 4.0 DRIVE AWAY COPY 1
2158867 2015/12/30 00:00:00 12 30 2015 DECEMBER WED 15.0 15:51 291.0 863464114 6283 CA 4EBP015 VN 388 TRC7.2.35 PARKING ON GRADES 106.0 VICENTE ST 1B4GP45LXWB766283 NaN NaN 1

Manually verified on google map:

  • 874320112 = 740 KIRKHAM = 37.760225, -122.46885
  • 863464114 = 106 VICENTE = 37.739940, -122.46737

The locations are not perfectly at the same spot. Off by approx few feets.

In [29]:
## drop index that is duplicate and gps is nan
drop_idx = df_gps[tic_num_dup & df_gps.hh_gpslat.isna()].index.values
print('df_gps index to be dropped: {}'.format(drop_idx))
df_gps.drop(drop_idx, axis = 0, inplace = True)

print('Number of duplicates is zero: {}'\
      .format(df_gps.duplicated(subset = 'ticket_number', keep = False).sum() == 0))
df_gps index to be dropped: [1388066 3354438]
Number of duplicates is zero: True
In [30]:
del tic_num_dup, drop_idx ## wrap up

Why are df and df_gps ticket numbers' different

From Shawn: If there is no violation code you can disregard those violations. Often times it comes from handwritten citations not ones that start with 8. 8xxxxxx citation numbers are handheld computer generated and should have all the data.

In [31]:
## Determine why df ticket number is different from df_gps's ticket number
print("df's ticket_number dtype is: {}".format(df.ticket_number.dtype))
print("df_gps's ticket_number dytpe is: {}".format(df_gps.ticket_number.dtype))
df's ticket_number dtype is: object
df_gps's ticket_number dytpe is: int64
In [32]:
## The length of df's ticket_number
tic_num_len = df.ticket_number.str.len()
In [33]:
## GPS's ticket are only 9 char long
df_gps.ticket_number.astype('str').str.len().describe()[['min', 'max']].astype('int')
Out[33]:
min    9
max    9
Name: ticket_number, dtype: int32
In [34]:
## There is a range of main dataframe's ticket number
print('Number of tickets with length:')
for i in range(6, 12):
    print('{}: {}'.format(i, (tic_num_len == i).sum()))
Number of tickets with length:
6: 17
7: 2343
8: 1
9: 3786633
10: 134070
11: 2
In [35]:
## Observations:
## ticket_number start with 'J'
## badge_# is always nan
## ticket_type_code is always 'P
## tick_street_no is always nan
## issue_year is always 2017
df[tic_num_len == 6].head(2)
Out[35]:
tick_issue_date issue_mnth_nbr issue_day_of_month issue_year issue_mnth issue_day tick_issue_hour tick_issue_time badge_# ticket_number tick_vin plate_state plate veh_body violation_code violation violation_desc_long tick_street_no tick_street_name tick_street_suffix tick_rp_vin tick_corr_type tick_corr_type_desc_long tkt_cnt
977932 2017/10/18 00:00:00 10 18 2017 OCTOBER WED 16.0 16:18 NaN J48326 5612 SU 185 V22502B WRONG WAY PARKING NaN 20TH NaN SHSRD78515U305612 NaN NaN 1
978100 2017/10/18 00:00:00 10 18 2017 OCTOBER WED 16.0 16:26 NaN J48327 7836 NaN 188 V22505B SIGNS NaN 20TH NaN JT3FJ60GXH1137836 NaN NaN 1
In [36]:
## Observations:
## ~half badge_# is missing
## 179 tick_street_no is not null
## Inferred from 50 rows:
## ticket_number either start with 'H'/'D' or end with 'G'
## ticket_number with 'G' has violation that looks like GO1.I.6A
## violation with 'G' are related to parking lots??
# df[tic_num_len == 7].sample(n = 50, random_state = 142)
df[tic_num_len == 7].head(2)
Out[36]:
tick_issue_date issue_mnth_nbr issue_day_of_month issue_year issue_mnth issue_day tick_issue_hour tick_issue_time badge_# ticket_number tick_vin plate_state plate veh_body violation_code violation violation_desc_long tick_street_no tick_street_name tick_street_suffix tick_rp_vin tick_corr_type tick_corr_type_desc_long tkt_cnt
12028 2015/06/06 00:00:00 6 6 2015 JUNE SAT 16.0 16:25 NaN 000941G NOT GIVEN 4D 433 GO1.I.6A NO PERMIT OR EXPIRED NaN EMP 32 NaN JTDKN3DU2F0426564 50.0 DIS PERS PERMSU 1
12376 2015/06/08 00:00:00 6 8 2015 JUNE MON 5.0 05:10 NaN H911145 8280 4D 0 NO VIOL NO VIOLATION CODE NaN 3RD NaN KMHDH4AE7DU008280 NaN NaN 1
In [37]:
df.loc[tic_num_len == 7, 'issue_year'].value_counts()
Out[37]:
2017    1353
2015     526
2016     464
Name: issue_year, dtype: int64
In [38]:
## I think 002613GX has an extra X
## Because there are only 3 other ticket_number 
## with the letter X
df[tic_num_len == 8]
Out[38]:
tick_issue_date issue_mnth_nbr issue_day_of_month issue_year issue_mnth issue_day tick_issue_hour tick_issue_time badge_# ticket_number tick_vin plate_state plate veh_body violation_code violation violation_desc_long tick_street_no tick_street_name tick_street_suffix tick_rp_vin tick_corr_type tick_corr_type_desc_long tkt_cnt
2361976 2017/04/08 00:00:00 4 8 2017 APRIL SAT 18.0 18:45 6.0 002613GX NOT GIVEN SU 434 GO1.I.6B PROHIBITED/RESTRICTE NaN E LOT 2T3ZF4DV0BW104621 NaN NaN 1
In [39]:
## ticket_number with X is a mistake
## The 2 other is length 11
df[df.ticket_number.str.contains('x', case = False)]
Out[39]:
tick_issue_date issue_mnth_nbr issue_day_of_month issue_year issue_mnth issue_day tick_issue_hour tick_issue_time badge_# ticket_number tick_vin plate_state plate veh_body violation_code violation violation_desc_long tick_street_no tick_street_name tick_street_suffix tick_rp_vin tick_corr_type tick_corr_type_desc_long tkt_cnt
1381046 2017/04/08 00:00:00 4 8 2017 APRIL SAT 2.0 02:08 NaN PD29286014X 0473 NaN 393 TRC7.2.40 PRK PROHIBITED DT 550.0 BROADWAY NaN JTEZU14R68K010473 NaN NaN 1
2361976 2017/04/08 00:00:00 4 8 2017 APRIL SAT 18.0 18:45 6.0 002613GX NOT GIVEN SU 434 GO1.I.6B PROHIBITED/RESTRICTE NaN E LOT 2T3ZF4DV0BW104621 NaN NaN 1
3844412 2017/05/03 00:00:00 5 3 2017 MAY WED 10.0 10:06 32.0 1004255232X NOT GIVEN NaN 391 TRC7.2.38 PARKING IN STANDS 331.0 TOWNSEND NaN 3N1CN7AP8FL937504 NaN NaN 1
In [40]:
## Could not infer a pattern from 50 rows
## Example: 1003223701, PD27331441, 000220869U
df[tic_num_len == 10].head(2)
Out[40]:
tick_issue_date issue_mnth_nbr issue_day_of_month issue_year issue_mnth issue_day tick_issue_hour tick_issue_time badge_# ticket_number tick_vin plate_state plate veh_body violation_code violation violation_desc_long tick_street_no tick_street_name tick_street_suffix tick_rp_vin tick_corr_type tick_corr_type_desc_long tkt_cnt
28 2015/05/13 00:00:00 5 13 2015 MAY WED 16.0 16:49 NaN PD27331441 5523 NaN 0 NO VIOL NO VIOLATION CODE NaN 6TH ST NaN NaN NaN 1
55 2015/05/13 00:00:00 5 13 2015 MAY WED 9.0 09:20 NaN PD27717185 3841 NaN 340 V5200A TWO PLATES REQD F/R NaN 20TH NaN WDBWK54F86F093841 NaN NaN 1
In [41]:
df.loc[tic_num_len == 10, 'issue_year'].value_counts()
Out[41]:
2017    50632
2015    42914
2016    40524
Name: issue_year, dtype: int64

It looks like ticket_number with length 6, 7, or 10 is from different type of officers.

In [42]:
# ## find all those special ticket numbers
# special_tick = df.ticket_number[tic_num_len == 10].str.match('[^A-Z]+', na = False)
# print('Number: {}'.format(special_tick.sum()))

# df.loc[tic_num_len == 10, 'ticket_number'].str.match('[^A-Z]+', na = False)

# df[tic_num_len == 10][special_tick]['issue_year'].value_counts()

Edit to df's ticket_number

In [43]:
## remove x from ticket_number
df.ticket_number.replace('X', '', regex = True, inplace = True)
print('Ticket contain X: {}'.format(df.ticket_number.str.contains('X').any()))
Ticket contain X: False
In [44]:
del tic_num_len ## wrap up
In [45]:
## Correcting ticket number introduce duplicate ticket numbers
print('N dupluciates? {}'\
      .format(df.duplicated(subset='ticket_number', keep=False).sum()))

## Only one ticket 1004255232 doesn't look like an actual duplicate
df[df.duplicated(subset='ticket_number', keep=False)]
N dupluciates? 6
Out[45]:
tick_issue_date issue_mnth_nbr issue_day_of_month issue_year issue_mnth issue_day tick_issue_hour tick_issue_time badge_# ticket_number tick_vin plate_state plate veh_body violation_code violation violation_desc_long tick_street_no tick_street_name tick_street_suffix tick_rp_vin tick_corr_type tick_corr_type_desc_long tkt_cnt
1381046 2017/04/08 00:00:00 4 8 2017 APRIL SAT 2.0 02:08 NaN PD29286014 0473 NaN 393 TRC7.2.40 PRK PROHIBITED DT 550.0 BROADWAY NaN JTEZU14R68K010473 NaN NaN 1
1381047 2017/04/08 00:00:00 4 8 2017 APRIL SAT 2.0 02:08 NaN PD29286014 0473 NaN 393 TRC7.2.40 PRK PROHIBITED DT 550.0 BROADWAY NaN JTEZU14R68K010473 NaN NaN 1
2361976 2017/04/08 00:00:00 4 8 2017 APRIL SAT 18.0 18:45 6.0 002613G NOT GIVEN SU 434 GO1.I.6B PROHIBITED/RESTRICTE NaN E LOT 2T3ZF4DV0BW104621 NaN NaN 1
3833600 2017/04/08 00:00:00 4 8 2017 APRIL SAT 18.0 18:45 6.0 002613G NOT GIVEN SU 434 GO1.I.6B PROHIBITED/RESTRICTE NaN E LOT 2T3ZF4DV0BW104621 NaN NaN 1
3844412 2017/05/03 00:00:00 5 3 2017 MAY WED 10.0 10:06 32.0 1004255232 NOT GIVEN NaN 391 TRC7.2.38 PARKING IN STANDS 331.0 TOWNSEND NaN 3N1CN7AP8FL937504 NaN NaN 1
3851864 2017/05/19 00:00:00 5 19 2017 MAY FRI 16.0 16:20 36.0 1004255232 NOT GIVEN NaN 180 V22500I BUS ZONE 50.0 DRUMM NaN 3FADP0L37BR299904 4.0 DRIVE AWAY COPY 1
In [46]:
## Only 1 is not an actual duplicate but will just remove it
print('Before: {:,}'.format(df.shape[0]))
df.drop_duplicates(subset='ticket_number', keep='first', inplace=True)
print('After: {:,}'.format(df.shape[0]))
Before: 3,923,066
After: 3,923,063

Determine if Date Time components are redundant

issue_mnth_nbr, issue_day_of_month, issue_year, issue_mnth, issue_day, issue_day, tick_issue_hour might have been derived from tick_issue_date + tick_issue_time. This makes the information reduntant and wasting space. These features can be made on demand later.

In [47]:
## Are there any nan
print('Number of nan: {}'.format(df.tick_issue_date.isna().sum()))
## Is issue_date consistently 19 char long
print(df.tick_issue_date.str.len().describe()[['min', 'max']])
Number of nan: 0
min    19.0
max    19.0
Name: tick_issue_date, dtype: float64
In [48]:
## Split the date and time component out
issue_datetime = df.tick_issue_date.str.split(' ', expand = True)
issue_datetime.head(2)
Out[48]:
0 1
0 2015/05/13 00:00:00
1 2015/05/13 00:00:00
In [49]:
## Determine if the time component is just 00:00:00
## This means that the time part is just a place holder
issue_datetime[1].unique()
Out[49]:
array(['00:00:00'], dtype=object)
In [50]:
## Split date into year, month, day then convert to integer 
issue_date = issue_datetime[0].str.split('/', expand = True)
issue_date.columns = ['year', 'month', 'day']
issue_date = issue_date.apply(pd.to_numeric, axis = 0) ## convert to integer
issue_date.head(2)
Out[50]:
year month day
0 2015 5 13
1 2015 5 13
In [51]:
## Determine if year, month, and day are the same as the 
print("Issue Date's Year == issue_year: {}"\
      .format((issue_date.year == df.issue_year).all()))
print("Issue Date's Month == issue_mnth_nbr: {}"\
      .format((issue_date.month == df.issue_mnth_nbr).all()))
print("Issue Date's day == issue_day_of_month: {}"\
      .format((issue_date.day == df.issue_day_of_month).all()))
Issue Date's Year == issue_year: True
Issue Date's Month == issue_mnth_nbr: True
Issue Date's day == issue_day_of_month: True

Since year, month and days are the same so I can infer that issue_mnth and issue_day are the same as well.

Time

In [52]:
## Is df_gps time component meaningless?
## Yup, it is just zero
df_gps.tick_issue_date.str.split(' ', expand = True)[1].unique()
Out[52]:
array(['00:00:00'], dtype=object)
In [53]:
## Is df's tick_issue_hour redundant 
print('tick_issue_hour nan: {}'.format(df.tick_issue_hour.isna().sum()))
print('tick_issue_time nan: {}'.format(df.tick_issue_time.isna().sum()))
print('Same idx nan: {}'.format(((df.tick_issue_hour.isna() == df.tick_issue_time.isna()).all())))
tick_issue_hour nan: 3629
tick_issue_time nan: 3629
Same idx nan: True
In [54]:
## Extract hour from time. Convert to float
hour = pd.to_numeric(df.tick_issue_time.str.split(':', expand = True)[0])

hour_not_nan = ~df.tick_issue_time.isna()
print('All match: {}'\
      .format((df.tick_issue_hour[hour_not_nan] == hour[hour_not_nan]).all()))
All match: True

Yes, tick_issue_hour contains the same information as tick_issue_time

In [55]:
## Look into distribution of time 
hour = pd.to_numeric(df.tick_issue_time.replace(':', '', regex = True), errors='coerce')
print('Number of NaN should be 3629: {}'.format(hour.isna().sum()))
Number of NaN should be 3629: 3629
In [56]:
hour.plot.hist(bins = 20)
plt.show()
In [57]:
del hour_not_nan, issue_datetime, hour, issue_date ## wrap up

Remove useless date and time components

In [58]:
df.drop(['issue_mnth_nbr', 'issue_day_of_month', 
         'issue_year', 'issue_mnth', 
         'issue_day', 'tick_issue_hour',], 
         axis = 1, inplace = True)
In [59]:
## Remove 00:00:00 from date columns
df.tick_issue_date.replace(' 00:00:00', '', regex = True, inplace = True)
df_gps.tick_issue_date.replace(' 00:00:00', '', regex = True, inplace = True) 
In [60]:
df.head(1)
Out[60]:
tick_issue_date tick_issue_time badge_# ticket_number tick_vin plate_state plate veh_body violation_code violation violation_desc_long tick_street_no tick_street_name tick_street_suffix tick_rp_vin tick_corr_type tick_corr_type_desc_long tkt_cnt
0 2015/05/13 15:50 286.0 854007081 0011 CA 5ZCL954 SW 377 TRC7.2.23B METER OUT DOWNTOWN 4420.0 MISSION ST 4S4BP61CX87310011 NaN NaN 1
In [61]:
df_gps.head(1)
Out[61]:
tick_issue_date ticket_number hh_gpslat hh_gpslon tkt_cnt handheld
0 2015/01/01 849682293 37.802057 -122.426795 1 True

Merge df and df_gps

In [62]:
## Need both columns to be string to merge correctly
df_gps.ticket_number = df_gps.ticket_number.astype('str')
In [63]:
df_merge = df.merge(df_gps, how = 'outer', on = 'ticket_number', indicator = True)
df_merge.shape
Out[63]:
(3923063, 24)
In [64]:
print('Number of rows with')
for i in ['left_only', 'right_only', 'both']:
    print('{}: {}'.format(i, (df_merge._merge == i).sum()))
Number of rows with
left_only: 137246
right_only: 0
both: 3785817
In [65]:
## Looks like all gps rows match with df's rows
print('Rows difference between df and df_gps: {}'.format(df.shape[0] - df_gps.shape[0]))
Rows difference between df and df_gps: 137246
In [66]:
## non-handheld tickets should be false
df_merge.handheld.fillna(value = False, inplace = True)
In [67]:
df_merge.head(3)
Out[67]:
tick_issue_date_x tick_issue_time badge_# ticket_number tick_vin plate_state plate veh_body violation_code violation violation_desc_long tick_street_no tick_street_name tick_street_suffix tick_rp_vin tick_corr_type tick_corr_type_desc_long tkt_cnt_x tick_issue_date_y hh_gpslat hh_gpslon tkt_cnt_y handheld _merge
0 2015/05/13 15:50 286.0 854007081 0011 CA 5ZCL954 SW 377 TRC7.2.23B METER OUT DOWNTOWN 4420.0 MISSION ST 4S4BP61CX87310011 NaN NaN 1 2015/05/13 37.726747 -122.433322 1.0 True both
1 2015/05/13 13:26 395.0 855558480 1969 CA 8U22714 PK 375 TRC7.2.22 STREET CLEANING 834.0 47TH AVE 1GCEC14X98Z291969 NaN NaN 1 2015/05/13 37.772650 -122.507840 1.0 True both
2 2015/05/13 11:59 193.0 855017295 TOO HIGH CA 8C50399 TK 179 V22500H DOUBLE PARKING 2284.0 MISSION ST RB - 6 NaN NaN 1 2015/05/13 37.760448 -122.419285 1.0 True both
In [68]:
## Only interested in complete merge
both_idx = df_merge._merge == 'both'

Determine if gps issue date is the same as df

In [69]:
print('All tick_issue_date match: {}'\
      .format((df_merge.loc[both_idx, 'tick_issue_date_x'] \
               == df_merge.loc[both_idx, 'tick_issue_date_y']).all()))
All tick_issue_date match: True
In [70]:
df_merge.drop(columns=['tick_issue_date_y'], inplace = True)
df_merge.shape
Out[70]:
(3923063, 23)

Imputing Issue_time

In [71]:
print('Missing time are all from handwritten tickets: {}'\
      .format(df_merge.loc[df_merge.tick_issue_time.isna(), 'handheld'].sum() == 0))
Missing time are all from handwritten tickets: True
In [72]:
## Look into distribution of time
hour = pd.to_numeric(df_merge.tick_issue_time.replace(':', '', regex = True), errors='coerce')
In [73]:
hour[~df_merge.handheld].plot.hist(bins = 20)
plt.xlabel('24 Hour Format')
plt.title('Handwritten Tickets Time distribution')
plt.show()
In [74]:
## Extract only tickets that are handwritten
tmp = df_merge.loc[~df_merge.handheld, ['violation_desc_long']].copy()
tmp['hour'] = hour[~df_merge.handheld]
In [75]:
f, (ax1, ax2) = plt.subplots(1, 2, sharey = True, figsize = (12, 4))
tmp.loc[tmp.violation_desc_long == 'REGISTRATION TABS', 'hour']\
    .plot(ax = ax1, kind = 'hist', bins = 20)
ax1.axvline(tmp.loc[tmp.violation_desc_long == 'REGISTRATION TABS', 'hour'].mode()[0], c = 'red')
ax1.set_title('Handwrittent ticket for Registration Tabs')
ax1.set_xlabel('24HR Time')
tmp.loc[tmp.violation_desc_long == 'DOUBLE PARKING', 'hour']\
    .plot(ax = ax2, kind = 'hist', bins = 20)
ax2.axvline(tmp.loc[tmp.violation_desc_long == 'DOUBLE PARKING', 'hour'].mode()[0], c = 'red')
ax2.set_title('Handwrittent ticket for Double Parking')
plt.show()

There does look like different violations has different time trend.

In [76]:
## create a lookup table mode time group by violation
hour_impute = tmp.groupby('violation_desc_long')['hour']\
                 .apply(lambda x: x.mode())\
                 .unstack()[0].to_frame(name = 'mode_time')\
                 .reset_index()
hour_impute['mode_time'] = hour_impute.mode_time.astype('int').astype('str').str.pad(4, fillchar='0')

## Need to format time back to original format
hour_impute['hour'] = hour_impute.mode_time.str.slice(start=0, stop=2)
hour_impute['minute'] = hour_impute.mode_time.str.slice(start=2)
hour_impute['time'] = hour_impute.hour.str.cat(hour_impute.minute, sep=':')
hour_impute.head()

to_impute = df_merge.loc[df_merge.tick_issue_time.isna(), ['violation_desc_long']].reset_index()
print(to_impute.shape)

## impute missing with mode lookup table
impute = pd.merge(to_impute, hour_impute, how='left', on='violation_desc_long')
impute.set_index('index', inplace = True)
## FAILURE TO DISPLAY IS NaN so just we 12pm as a value
impute.time.fillna(value='12:00', inplace=True)
print(impute.shape)
print('Number of nan: {}'.format(impute.time.isna().sum()))
impute.head()
(3629, 2)
(3629, 5)
Number of nan: 0
Out[76]:
violation_desc_long mode_time hour minute time
index
577 REGISTRATION TABS 1000 10 00 10:00
1475 PUBLIC PROPERTY 1259 12 59 12:59
2068 PARKING RESTRICTIONS 1230 12 30 12:30
2410 WHEELCHAIR ACCESS 0640 06 40 06:40
4181 NO VIOLATION CODE 0640 06 40 06:40
In [77]:
# ## Using Xi's way
# tmp = pd.to_datetime(df_merge.tick_issue_date_x, errors='coerce')

# ## extract date information
# df_merge['year'] = tmp.dt.year
# df_merge['month'] = tmp.dt.month
# df_merge['dow'] = tmp.dt.dayofweek

# ## extract just the hour from the time column
# df_merge['hour'] = pd.to_numeric(df_merge.tick_issue_time.str.slice(start=0, stop=2), errors='coerce')

# ## Create lookup table for hour based on year, month, day of week, and violation_desc_long
# ## slightly different from Xi where he used violation_group is a column made later in the 
# ## notebook
# tix_time_lookup = df_merge.groupby(['year', 'month', 'dow', 'violation_desc_long'])['hour'].median().reset_index()
# tix_time_lookup.head(3)

# ## extract just the rows that needs be imputed
# to_impute = df_merge.loc[df_merge.tick_issue_time.isna(), ['violation_desc_long']].reset_index()
# print(to_impute.shape)

# ## impute missing with mode lookup table
# impute = pd.merge(to_impute, tix_time_lookup, how='left', on='violation_desc_long')
# print('Number of nan: {}'.format(impute.hour.isna().sum()))
# ## Imputing this way will result with ~1k nan so not good
# ## If I were to implement Xi's way, I will reorder basically my whole notebook,
# ## for ~3k imputation, I don't think it is worth the work and bugs.
In [78]:
## Integrate result back into dataframe
df_merge.iloc[impute.index, df_merge.columns.get_loc('tick_issue_time')] = impute['time']
In [79]:
## wrap up
del hour, tmp, to_impute, impute, f, ax1, ax2, hour_impute 
In [80]:
print('There are no missing rows in issue_time: {}'\
      .format(df_merge.tick_issue_time.isna().sum() == 0))
There are no missing rows in issue_time: True

Combining Date and Time

In [81]:
df_merge['issue_datetime'] = pd.to_datetime(df_merge['tick_issue_date_x'] \
                                            + ' ' + df_merge['tick_issue_time'],
                                            format = '%Y/%m/%d %H:%M')
print('Number of nan: {}'.format(df_merge.issue_datetime.isna().sum()))
Number of nan: 0
In [82]:
df_merge.tail().loc[:,['tick_issue_date_x', 'tick_issue_time', 'issue_datetime']]
Out[82]:
tick_issue_date_x tick_issue_time issue_datetime
3923058 2017/10/23 12:44 2017-10-23 12:44:00
3923059 2017/10/23 00:12 2017-10-23 00:12:00
3923060 2017/10/23 11:17 2017-10-23 11:17:00
3923061 2017/10/23 06:33 2017-10-23 06:33:00
3923062 2017/10/23 19:51 2017-10-23 19:51:00
In [83]:
df_merge.issue_datetime.value_counts().plot()
plt.show()
In [84]:
print('Before: {}'.format(df_merge.shape))
## Drop redundant columns
df_merge.drop(columns = ['tick_issue_date_x', 'tick_issue_time'], inplace = True)
print('After: {}'.format(df_merge.shape))
Before: (3923063, 24)
After: (3923063, 22)

Conclusion: issue_date looks fine.

Tkt cnt

In [85]:
print(df.tkt_cnt.value_counts())
print(df_gps.tkt_cnt.value_counts())
1    3923063
Name: tkt_cnt, dtype: int64
1    3785603
2        214
Name: tkt_cnt, dtype: int64

Only df_gps has tkt_cnts that are not 1.

In [86]:
## Find idx where tkt_cnt does not match
tkt_cnt_idx = df_merge['tkt_cnt_x'] != df_merge['tkt_cnt_y']
In [87]:
## Look at idx that doesn't have match tkt_cnt
df_merge[both_idx & tkt_cnt_idx].head(3)
Out[87]:
badge_# ticket_number tick_vin plate_state plate veh_body violation_code violation violation_desc_long tick_street_no tick_street_name tick_street_suffix tick_rp_vin tick_corr_type tick_corr_type_desc_long tkt_cnt_x hh_gpslat hh_gpslon tkt_cnt_y handheld _merge issue_datetime
75370 2.0 850464020 9507 CA 1437978 2D 420 TRC7.2.83 TRUCK LOAD ZONE 3904.0 24TH ST WMEEJ9AAXEK799507 NaN NaN 1 37.751577 -122.429975 2.0 True both 2015-01-14 10:18:00
115187 109.0 854050061 3482 CA 4VNS682 4D 195 V22514 FIRE HYDRANT 1246.0 MISSION ST 1FAFP55S72A134827 NaN NaN 1 37.776673 -122.414283 2.0 True both 2015-04-06 09:31:00
163141 309.0 861082294 3051 CA 5SNF616 4D 375 TRC7.2.22 STREET CLEANING 1965.0 GENEVA AVE 2B3KA43G56H293051 NaN NaN 1 37.710583 -122.425405 2.0 True both 2015-10-09 06:27:00
In [88]:
## Doesn't look like it is connected to date or time
df_merge.loc[both_idx & tkt_cnt_idx].duplicated(subset = 'issue_datetime', keep = False).sum()
Out[88]:
4
In [89]:
## Doesn't like a specific worker is giving this out
df_merge.loc[both_idx & tkt_cnt_idx, 'badge_#'].value_counts().plot.hist()
plt.title('')
plt.show()
In [90]:
## No conclusion
df_merge.loc[both_idx & tkt_cnt_idx,'tick_corr_type'].value_counts(dropna = False)
Out[90]:
NaN       163
 22.0      12
 51.0       5
 10.0       4
 802.0      4
 50.0       4
 59.0       4
 110.0      2
 31.0       2
 67.0       2
 19.0       2
 9.0        1
 111.0      1
 126.0      1
 11.0       1
 18.0       1
 43.0       1
 213.0      1
 40.0       1
 468.0      1
 14.0       1
Name: tick_corr_type, dtype: int64

I think the count is probably a mistake. This population ony account for ~0.006%. And tkt_cnt is just 1 for the main dataset. I don't believe these two columns are useful.

In [91]:
print(df_merge.shape)
df_merge.drop(columns=['tkt_cnt_x', 'tkt_cnt_y'], inplace = True)
print(df_merge.shape)
(3923063, 22)
(3923063, 20)
In [92]:
del both_idx, tkt_cnt_idx, df_gps ## wrap up

Conclusion: I believe tkt_cnt does not contain any useful information.

'Badge_#'

In [93]:
# df_merge['badge_#'].value_counts(dropna = False).to_csv('dataset/badge.csv')
## badge_number.csv
## badge number starts from 1 to 434. 
## There are unique ones like 1889, 2405, 9154, 9170, 9371, 9373
## There are gaps in between the numbering usually just one
## but once it hits 400 then the gaps are larger (4+)
In [94]:
print('NaN has: {}'.format(df_merge['badge_#'].isna().sum()))
df_merge['badge_#'].value_counts(dropna = True).plot.hist()
plt.title('Number of Tickets Per Badge Number')
plt.show()
NaN has: 85380
In [95]:
## change badge_# name and replace nan with -1
df_merge['badge'] = df_merge['badge_#'].copy()
In [96]:
df_merge.loc[df_merge['badge'].isna(), 'badge'] = -1
df_merge['badge'] = pd.to_numeric(df_merge['badge'], downcast='integer')
df_merge.badge.dtype
Out[96]:
dtype('int16')
In [97]:
print(df_merge.shape)
df_merge.drop(columns = ['badge_#'], inplace = True)
print(df_merge.shape)
(3923063, 21)
(3923063, 20)

Conclusion: The values looks fine, but I am sure what those few special badge number that are different from the rest.

plate_state

In [98]:
## Only the 50 states plus the captial
states = {"AL":"Alabama","AK":"Alaska","AZ":"Arizona","AR":"Arkansas",
          "CA":"California","CO":"Colorado","CT":"Connecticut","DE":"Delaware",
          "DC":"Washington DC","FL":"Florida",
          "GA":"Georgia","HI":"Hawaii","ID":"Idaho","IL":"Illinois","IN":"Indiana",
          "IA":"Iowa","KS":"Kansas","KY":"Kentucky","LA":"Louisiana","ME":"Maine",
          "MD":"Maryland","MA":"Massachusetts","MI":"Michigan",
          "MN":"Minnesota","MS":"Mississippi","MO":"Missouri","MT":"Montana","NE":"Nebraska",
          "NV":"Nevada","NH":"New Hampshire","NJ":"New Jersey","NM":"New Mexico","NY":"New York",
          "NC":"North Carolina","ND":"North Dakota","OH":"Ohio",
          "OK":"Oklahoma","OR":"Oregon","PA":"Pennsylvania",
          "RI":"Rhode Island","SC":"South Carolina","SD":"South Dakota","TN":"Tennessee",
          "TX":"Texas","UT":"Utah","VT":"Vermont","VA":"Virginia",
          "WA":"Washington","WV":"West Virginia","WI":"Wisconsin","WY":"Wyoming"}
states_df = pd.DataFrame.from_dict(states, orient='index')
states_df.reset_index(inplace = True)
states_df.columns = ['abbr', 'state_full']
states_df.head(3)
Out[98]:
abbr state_full
0 AL Alabama
1 AK Alaska
2 AZ Arizona

Merge State Abbr with Full name

In [99]:
df_merge = df_merge.merge(states_df, how = 'left', 
                          left_on = 'plate_state', right_on = 'abbr',
                          copy = False)
In [100]:
## merge snapshot
df_merge.loc[2:5, ['plate_state', 'abbr', 'state_full']]
Out[100]:
plate_state abbr state_full
2 CA CA California
3 NC NC North Carolina
4 CA CA California
5 CA CA California
In [101]:
## Create new value N/A
## This means state_plate was empty 
df_merge.loc[df_merge.plate_state.isin([' ', '|', np.nan]), 'state_full'] = 'N/A'
In [102]:
## Unmatch rows are converted to unknown
## There were values but unable to decode
df_merge.state_full.fillna(value = 'Unknown', inplace = True)
In [103]:
## Output csv file for easy viewing
df_merge.groupby(['plate_state', 'state_full'])['abbr']\
    .size().reset_index().to_csv(os.path.join('.','dataset','state_plate.csv'))
In [104]:
## Summary
print('Percentage per new group:')
for i in ['N/A', 'Unknown']:
    print('{}: {:.2f}%'.format(i, (df_merge.state_full == i).sum()/df_merge.shape[0]*100))
Percentage per new group:
N/A: 3.75%
Unknown: 0.10%
In [105]:
print(df_merge.shape)
## plate_state is no longer needed
df_merge.drop(columns = ['plate_state', 'abbr'], inplace = True)
print(df_merge.shape)
(3923063, 22)
(3923063, 20)
In [106]:
del states, states_df

Conclusion: Cannot really say much about cleaning this columning because there are only two letters to work with so it hard to guess wheather there was a typo because it could be the 1st or the 2nd letter which opens up to too many possible value. In addition, the actual population where value wasn't able to be decode is a very small proprotion.

Tick_vin

In [107]:
df_merge.tick_vin.head(10)
Out[107]:
0                 0011
1                 1969
2             TOO HIGH
3                 4374
4                 0556
5                 9986
6    1N4AA5AP8BC861802
7                 2166
8             TOO HIGH
9         CAN NOT READ
Name: tick_vin, dtype: object
In [108]:
vin_len = df_merge.tick_vin.str.len()
In [109]:
vin_len.value_counts(dropna=False, sort = False)
Out[109]:
 2.0           9
 3.0          92
 4.0     2609483
 5.0        6025
 6.0         927
 7.0      206797
 8.0      112030
 9.0       45381
 10.0         23
 11.0         24
 12.0     655016
 13.0        161
 14.0         29
 15.0         64
 16.0      88588
 17.0     196170
 18.0          3
 19.0          2
NaN         2239
Name: tick_vin, dtype: int64

Length 4 is the standard! So try to convert all to length 4 if possible.

In [110]:
## length 2
df_merge.loc[vin_len == 2, ['tick_vin', 'tick_rp_vin']]
Out[110]:
tick_vin tick_rp_vin
69246 20 JTDBE32K220103644
244320 44 1NXBR30EX6Z737344
705869 84 3VWRA29M6XM018464
1672702 59 JT8JS47E6R0086059
2103499 94 1FBAX2CG1GKA01262
2399139 64 1FTHE24H9THA49364
2962315 38 4T3ZF13C0WU016639
3274595 31 NaN
3453360 85 JT2BG22KXW0165585

Most of the last two number matches the tick_vin but there are mistakes like 705869 where tick_vin = 84 but the last 4 vin number is actually 8464.

In [111]:
## length 3
df_merge.loc[vin_len == 3, ['tick_vin', 'tick_rp_vin']].head(7)
Out[111]:
tick_vin tick_rp_vin
9317 252 JHLRD78516C031384
34985 661 1G4GB5G38FF316611
36729 417 YV1VW29562F806417
107529 034 JM1FB3325F0856034
148033 900 900
181324 COU 3GKFK16T01G190864
241200 886 5TDZA3EH4DS038886
In [112]:
## values that are length 3 and which one contains alphablets 
vin_tmp = df_merge.loc[vin_len == 3, 'tick_vin']
vin_tmp[vin_tmp.str.contains('\D+')].value_counts()
Out[112]:
COJ    1
VTW    1
AP4    1
COU    1
UTL    1
Name: tick_vin, dtype: int64
In [113]:
## Make sure that there are no descriptions in length 4 vins
vin_tmp = df_merge.loc[vin_len == 4, 'tick_vin'].str.contains('\D+')
print('Number of tick_vin length 4 with alphabets: {}'.format(vin_tmp.sum()))
df_merge.loc[vin_len == 4, ['tick_vin', 'tick_rp_vin', 'plate']][vin_tmp].head(10)
Number of tick_vin length 4 with alphabets: 651
Out[113]:
tick_vin tick_rp_vin plate
476 DVWY 1HGES165X1L015213 7FNS363
30289 632C 1J4FY19S3VP536320 HOPESF
30408 A912 1FMZU74W42ZA91206 4VAP934
58135 B004 5UXZV4C53D0B00432 TSURISS
69069 DVWY SMT395DJ0TJ035043 14G6029
76021 A778 NaN BFN1445
78889 DVWY JT6HF10UXX0093122 4HDT502
79162 2S10 NaN 869PEY
90242 459L ZAPM459L1A5701245 20N0083
91720 DVWY 1GKES16S946197501 5GZL643
In [114]:
## Length 4 and only alphabet
vin_tmp = df_merge.loc[vin_len == 4, ['tick_vin', 'tick_rp_vin']]
vin_tmp[vin_tmp.tick_vin.str.isalpha() & (vin_tmp.tick_vin != 'DVWY')].head(10)
Out[114]:
tick_vin tick_rp_vin
96251 DNDD NaN
148545 NNES NaN
245603 ZAPM ZAPM459L6F5100677
281135 XXXX NaN
290451 RLRC 4TAVL52N6TZ127043
352122 ZAPM ZAPM459L3B5702009
549124 QJKF WBAVB17596NK35917
642655 JKAK JKAKLEE12ADA30932
719925 WBAV WBAVB13506KX39716
793139 RRRC 4F2CZ49309KM04134

Those with alphabet is because they either misread like row 30289 or they just input a random part of the actual vin.

In [115]:
## Something interesting.
## Only badge 381 input the unique value DVWY
## But it is only a fraction of what this person does
print('Total count of DVWY: {}'\
      .format(df_merge.tick_vin.str.contains('DVWY').sum()))
print('Who input this: {}'\
      .format(df_merge.loc[df_merge.tick_vin == 'DVWY', 'badge'].unique()))
print('Total tickets by 381: {}'.format((df_merge.badge == 381).sum()))
Total count of DVWY: 74
Who input this: [381]
Total tickets by 381: 3473
In [116]:
## Length 5 
## Looks fine
df_merge.loc[vin_len == 5, ['tick_vin', 'tick_rp_vin']].head(10)
Out[116]:
tick_vin tick_rp_vin
1119 16221 JTMZD32V785116221
1598 45584 JM1BL1SG2A1145584
1668 04880 WVWMP7AN0CE504880
2098 47891 2T1BU4EE7AC347891
2137 83518 NaN
2208 33531 2T2BK1BA5AC033531
2280 06846 NaN
3286 11417 NaN
3388 11343 JTDKN3DUXA0111343
3519 67744 JHMGD38628S067744
In [117]:
## Length 5 with alphabet
vin_tmp = df_merge.loc[vin_len == 5, 'tick_vin'].str.contains('\D+')
df_merge.loc[vin_len == 5, ['tick_vin', 'tick_rp_vin']][vin_tmp].head(10)
Out[117]:
tick_vin tick_rp_vin
146687 E5200 E5200
342114 X66JO JT3AC11R5M0006630
382346 X2031 1G8ZH54822Z112031
438005 FERNA NaN
438669 5204A 36654363
591672 C9907 1FTRW12W37KC99074
863322 X6029 1FMCU0JX5GUB16029
872075 X3060 1HGCD5635VA183060
872078 D1725 D1725
911867 X8950 NaN
In [118]:
## Length 5 with only alphabet
## There is only one
vin_tmp = df_merge.loc[vin_len == 5, ['tick_vin', 'tick_rp_vin']]
vin_tmp.loc[vin_tmp.tick_vin.str.isalpha(), 'tick_vin'].value_counts()
Out[118]:
FERNA    1
Name: tick_vin, dtype: int64
In [119]:
del vin_tmp, vin_len ## wrap up

Conclusion: tick_vin that was input by human is very unreliable. And there is no way to fix it. The common errors are decription in column like CAN NOT READ or TOO HIGH, random locations of the vin to record (sometime the beginning of the vin which is the same for many cars), random length to record, typo, misread label, or just random input just to fill in the space.

Tick_rp_vin

In [120]:
df_merge.tick_rp_vin.isna().sum()
Out[120]:
371712
In [121]:
## vin character length
rp_vin_len = df_merge.tick_rp_vin.str.len()
In [122]:
rp_vin_len.value_counts(dropna = False)
Out[122]:
 17.0    3397337
NaN       371712
 8.0      105128
 4.0        9576
 6.0        9527
 11.0       5004
 13.0       4432
 5.0        4160
 2.0        3263
 7.0        3109
 3.0        2578
 10.0       2487
 9.0        1336
 12.0       1313
 14.0        916
 1.0         798
 18.0        144
 16.0        101
 15.0         94
 19.0         47
 20.0          1
Name: tick_rp_vin, dtype: int64
In [123]:
(rp_vin_len != 17).sum()/df_merge.shape[0]*100
Out[123]:
13.400906383608929
In [124]:
del rp_vin_len ## wrap up

Conclusion: Have not digged too deeply into this feature.

Plate

In [125]:
df_merge.plate.str.len().value_counts(dropna = False)
Out[125]:
7    3517031
1     260932
6     127013
5      11731
4       4171
3       1008
8        950
2        227
Name: plate, dtype: int64
In [126]:
## Get the length of license plate input
plate_len = df_merge.plate.str.len()
In [127]:
## Count the unique values with length 1
df_merge.loc[plate_len == 1, 'plate'].value_counts().to_frame().transpose()
Out[127]:
0 1 2 5 3 4 C 7 T 6 A H 9 M U 8 F Y B X W Z N & G K D L
plate 260655 78 72 19 15 15 10 9 7 6 6 5 4 4 3 3 3 3 2 2 2 2 1 1 1 1 1 1 1
In [128]:
## Length 6 is also standard for some states. 
## California and a few other have length 7
In [129]:
## Length 5 looks fine. inferred from 50 rows
## Some are custom plates
# df_merge[plate_len == 5].head(50)
df_merge.loc[plate_len == 5, 'plate'].sample(n = 10, random_state = 142)
Out[129]:
2752843    15191
3330239    AGLEN
1132427    66586
379452     RAFTS
746356     PIPX0
2799475    7M007
3838730    20295
510152     S0LLY
626381     33988
1914054    63277
Name: plate, dtype: object
In [130]:
## Length 4 looks fine
## Custom plate probably
# df_merge[plate_len == 4].sample(n = 50, random_state = 142)
df_merge.loc[plate_len == 4, 'plate'].sample(n = 10, random_state = 142)
Out[130]:
1093436    IMHE
1041904    GSGH
1985786    WFGD
3785402    COWP
1983391    DWEU
3450521    8190
1504529    JH33
2108771    SLOW
2421464    ETEA
638528     7VHC
Name: plate, dtype: object
In [131]:
## Length 3 looks fine
## Custom plate probably
# df_merge[plate_len == 3].sample(n = 50, random_state = 142)
df_merge.loc[plate_len == 3, 'plate'].sample(n = 10, random_state = 142)
Out[131]:
1965956    BTN
2868230    HR1
3397945    797
1206834    AJU
1110143    509
2657884    509
3164381    ETS
1235295    VKG
3259642    CCI
3525371    OIC
Name: plate, dtype: object
In [132]:
## Length 8 
## I don't know if they are special plates 
## or fake plates like ILVESHOE or HMMRTYME
# df_merge[plate_len == 8].sample(n = 50, random_state = 142)
df_merge.loc[plate_len == 8, 'plate'].sample(n = 10, random_state = 142)
Out[132]:
3102974    22622002
2324480    21369781
3767865    TK622MPQ
1405317    ILVESHOE
867359     DLR85259
2940050    68106333
2956327    5BEXX882
336277     47011620
677864     7AKUU196
3692835    7RNNX238
Name: plate, dtype: object
In [133]:
## Length 2
## These look like random inputs. And I don't think license plate
## Can be two letter long. 
# df_merge[plate_len == 2].sample(n = 50, random_state = 142)
df_merge.loc[plate_len == 2, 'plate'].sample(n = 10, random_state = 142)
Out[133]:
1196299    18
1803370    03
3734645    05
1359704    GB
174305     34
2118587    02
2947507    41
22817      4F
3397537    SK
1331423    9Z
Name: plate, dtype: object
In [134]:
print('Number of nan before: {}'.format(df_merge.plate.isna().sum()))
## Convert length 1 and 2 to N/A or something like that
df_merge.loc[plate_len <= 2, 'plate'] = np.nan
print('Number of nan after: {}'.format(df_merge.plate.isna().sum()))
Number of nan before: 0
Number of nan after: 261159
In [135]:
del plate_len ## wrap up

Conclusion: It is hard to tell whether a plate is valid because states can have different license plate standard and sometime their format are changed later. In addition, custom plates are allowed. Also it is possible that some plates are fake but cannot be easily tell.

Veh_body

In [136]:
df_merge.veh_body.isna().sum()
Out[136]:
25348
In [137]:
df_merge.veh_body.value_counts(dropna = False).to_csv('dataset/veh_body.csv')
In [138]:
print(df_merge.veh_body.unique())
['SW' 'PK' 'TK' '2D' '4D' 'SU' nan 'CV' 'MS' 'TL' 'VN' 'MC' 'TN' 'PV' 'MH'
 'MP' 'BU' 'LM' 'SB' 'PU' 'HS' 'FB' 'VA' 'UT' '4W' 'SV' 'SD' 'BT' 'TW' 'HR'
 'TR' 'CE' 'WH' 'FR' '4H' 'SC']

Conclusion: I tried decoding the vehicle body style code but did not have luck. There are minor detail between the different table. In addition, I was not able to find what 'SU' stand for which is important because it is top 3 (amount = 378722) in the value_count table. As a result, I have to leave it has is.

Violation, Violation code, Violation desc long

In [139]:
print('Unique Values and Number of NaN for each:')
for i in ['violation', 'violation_code', 'violation_desc_long']:
    print('{}: Unique: {} NaN: {}'.format(i, 
                                          df_merge[i].unique().shape[0], 
                                          df_merge[i].isna().sum()))
Unique Values and Number of NaN for each:
violation: Unique: 186 NaN: 0
violation_code: Unique: 186 NaN: 0
violation_desc_long: Unique: 172 NaN: 0
In [140]:
## Created violation_group_lookup.xlsx from this function
# df_merge.groupby(['violation_code', 'violation', 'violation_desc_long'])\
#     .size().reset_index(name = 'counts').to_csv('dataset/violation_group.csv')
## Afterward, manually bin similar violation together like Prohbited Parking == Parking Restricted
## Or Parking Intersection == Block/Intersection == Park/Veh Crossing

## 3/5/2018
## Shawn redid the lookup table such that it better represent SFMTA's view 
## had to resaved shawn's excel file because it had to be repaired

Merge Lookup Table

In [141]:
violation_lookup = pd.read_excel(os.path.join('.','dataset','violation_group_lookup-sm_1.xlsx'), 0)
In [142]:
print(violation_lookup.shape)
violation_lookup.head(3)
(186, 6)
Out[142]:
violation_code violation violation_desc_long violation_desc_general violation_group counts
0 0 NO VIOL NO VIOLATION CODE No Violation code No Violation 2591
1 23 H22 PRK PROHIB Parking Restricted No Parking 116
2 29 H24B RED/WHITE ZONE Red Zone Safety 15
In [143]:
print('Before: {}'.format(df_merge.shape))
## Merge df_merge with violation_lookup
df_merge = df_merge.merge(violation_lookup[['violation_code', 'violation_desc_general', 'violation_group']], 
                          on = 'violation_code', how = 'left')
print('After: {}'.format(df_merge.shape))
Before: (3923063, 20)
After: (3923063, 22)
In [144]:
df_merge[['ticket_number', 'violation_code', 'violation', 
          'violation_desc_long', 'violation_desc_general', 'violation_group']]\
        .sample(n = 10, random_state = 142)
Out[144]:
ticket_number violation_code violation violation_desc_long violation_desc_general violation_group
2481743 857954112 374 TRC7.2.20 RESIDENTIAL OVERTIME Overtime Parking - Residential Overtime
634614 890933772 375 TRC7.2.22 STREET CLEANING Street Cleaning Street Cleaning
581245 848013165 375 TRC7.2.22 STREET CLEANING Street Cleaning Street Cleaning
789137 872295815 375 TRC7.2.22 STREET CLEANING Street Cleaning Street Cleaning
270492 869634021 375 TRC7.2.22 STREET CLEANING Street Cleaning Street Cleaning
3117089 861838246 375 TRC7.2.22 STREET CLEANING Street Cleaning Street Cleaning
1547351 850585842 377 TRC7.2.23B METER OUT DOWNTOWN Meter Parking - Expired Out DownTown Meter Related
446344 884288064 374 TRC7.2.20 RESIDENTIAL OVERTIME Overtime Parking - Residential Overtime
2571010 854041856 375 TRC7.2.22 STREET CLEANING Street Cleaning Street Cleaning
455115 885067853 375 TRC7.2.22 STREET CLEANING Street Cleaning Street Cleaning
In [145]:
del violation_lookup ## wrap up

Conclusion: I looked at the description of the violation and combined similar violation together.

For instance:

  • Park Prohibited and Towaway is the same as Park Restricted
  • Invalid/Expired Permit is the same as No Permit
  • Large Vehicle is the same as Commerical Vehicle
  • Parking blue zone, blocking blue zone, parking in cross hatch is the same
  • Saftey Zone is Red Zone

tick_corr_type, tick_corr_type_desc_long

These column was done after violation, violation code, violation desc long. Not sure what impact that has.

In [146]:
print('Percent of non-null: {:0.2f}%'\
      .format(df_merge.tick_corr_type_desc_long.notna().sum()/df_merge.shape[0]*100))
Percent of non-null: 10.65%
In [147]:
## Count tickets with correction groupby violation_group, violation_desc_long and tick_corr_type_desc_long
## sort by counts in violation_groups. 
## To which group has the most correction and if there are obvious trend. 
tmp1 = df_merge[df_merge.tick_corr_type.notna()].groupby(['violation_group', 'violation_desc_long', 'tick_corr_type_desc_long'])\
                .size().reset_index(name = 'counts')
tmp2 = df_merge[df_merge.tick_corr_type.notna()].groupby('violation_group').size().reset_index(name = 'group_count')
tmp3 = pd.merge(tmp1, tmp2, on = 'violation_group').sort_values(by = ['group_count', 'counts'], ascending = False)
tmp3.to_csv('dataset/tick_corr.csv')
tmp3.head()
Out[147]:
violation_group violation_desc_long tick_corr_type_desc_long counts group_count
2361 Street Cleaning STREET CLEANING STR CLN PROT DN 18043 85487
2310 Street Cleaning STREET CLEANING DIS PERS PERMSU 9883 85487
2301 Street Cleaning STREET CLEANING 802REFNOTF 8417 85487
2356 Street Cleaning STREET CLEANING STILL OWING 8176 85487
2354 Street Cleaning STREET CLEANING SIGN POSTED VAL 5609 85487
In [148]:
del tmp1, tmp2, tmp3 ## wrap up

Conclusion: tick_corr is more of administrative work such as determine if there are other complications to the citations like whether the car was stolen, is tick still owing, the car drove away before the officer was able to place the notice. In some cases, it doesn't haven anything to do with the violation therefore reducing the available information (BLOCK/INTERSECTION --> DRIVE AWAY COPY). And in some case just redunant information (PARKING ON GRADES == 58A CURBED WHLS). Not only that, most description are hard to understand because they are abbreviated (GEN VALID CITE, 802REFNOTF, SRVY CNDCT-VAL)

I will leave this feature as is.

hh_gpslat hh_gpslon

This part was not updated after adding the method to filter out points outside of san francisco

In [149]:
with pd.option_context('display.float_format', '{:0.6f}'.format):
    print(df_merge[['hh_gpslat', 'hh_gpslon']].describe())
           hh_gpslat      hh_gpslon
count 2502920.000000 2502920.000000
mean       37.770906    -122.431263
std         0.023551       0.029444
min        37.708147    -122.513813
25%        37.757813    -122.449192
50%        37.776722    -122.425090
75%        37.788275    -122.408058
max        37.810078    -122.367615
In [150]:
## Determine if lat and long are missing at the same time
print('GPS are missing together: {}'\
      .format((df_merge.hh_gpslat.isna() == df_merge.hh_gpslon.isna()).all()))
print('GPS are value 0 together: {}'\
      .format(((df_merge.hh_gpslat == 0) == (df_merge.hh_gpslon == 0)).all()))
GPS are missing together: True
GPS are value 0 together: True
In [151]:
print('Nan total: {:,}'.format(df_merge.hh_gpslat.isna().sum()))
Nan total: 1,420,143
In [152]:
# print('Nan before: {}'.format(df_merge.hh_gpslat.isna().sum()))
# ## Convert all gps with value 0 to nan
# df_merge.loc[df_merge.hh_gpslat == 0, 'hh_gpslat'] = np.nan
# df_merge.loc[df_merge.hh_gpslon == 0, 'hh_gpslon'] = np.nan
# print('Nan after: {}'.format(df_merge.hh_gpslat.isna().sum()))
In [153]:
## Keep record for which row was missing gps coordinates
df_merge['gps_missing_record'] = df_merge.hh_gpslat.isna()
## gps_missing is later
df_merge['gps_missing'] = df_merge.hh_gpslat.isna()
## indicator for which rows was imputed 
df_merge['imputed'] = False
In [154]:
## Percent missing
print('Percent gps coord missing: {:.2f}%'\
      .format(df_merge.hh_gpslat.isna().sum()/df_merge.shape[0]*100))
Percent gps coord missing: 36.20%
In [155]:
## Percent missing for handheld issued tickets
print('Percent gps coord missing for handheld: {:0.2f}%'\
      .format(df_merge[df_merge.handheld].hh_gpslat.isna().sum()/df_merge[df_merge.handheld].shape[0]*100))
Percent gps coord missing for handheld: 33.89%

There are some random stray that are in the ocean or one in Salt lake City. However, they are limited so I am not too worry about them. (UPDATE: after using a bounding box and geopandas to determine if points are inside the boundaries. These outliers are removed. As can be seen in above)

Rough estimates to how accurate gps coordinates are for a single street

In [156]:
## Export df_merge for tableau visualization
## Only columns needed
# df_merge.loc[~df_merge.gps_missing, 
#              ['ticket_number', 'issue_datetime', 
#               'hh_gpslat', 'hh_gpslon', 'tick_street_name']]\
#             .to_csv('dataset/tableau/df_merge_min.csv')
In [157]:
## Manually selected selected points that were approximately one block outside of Geary
Image(filename = os.path.join('.','image','geary_all_year_oneblock_filter.png'),
      width = 500, height = 100)
Out[157]:
In [158]:
geary_out = pd.read_csv(os.path.join('.','dataset','tableau', 'geary_outside_oneblock.csv'))
print('Number of rows outside of Geary: {}'.format(geary_out.shape[0]))
## Out of all tickets with 'Geary' street_name what is the percent bad 
print('Percent of ticket population: {:0.3f}%'\
      .format(geary_out.shape[0]/
              df_merge.tick_street_name.str.contains('GEARY',regex=False).sum()*100))
Number of rows outside of Geary: 342
Percent of ticket population: 0.552%
In [159]:
## Obtain the tickets using index from geary_out
df_merge.iloc[geary_out.F1][['ticket_number','tick_street_no', 
                             'tick_street_name', 'tick_street_suffix', 
                             'hh_gpslat', 'hh_gpslon']]\
                            .sample(n = 5, random_state = 33)
Out[159]:
ticket_number tick_street_no tick_street_name tick_street_suffix hh_gpslat hh_gpslon
2232830 870145566 2198.0 SUTTER ST 37.785707 -122.436478
2549946 851489984 505.0 MISSION ST 37.789585 -122.397912
1330190 875765715 825.0 GEARY ST 37.783280 -122.432660
209033 864812686 442.0 GEARY ST 37.765145 -122.396020
2072743 853403471 2144.0 UNION ST NaN NaN

Correct gps coordinates:

  • #870067656 4800 GEARY BLVD
    • wrong: 358 Funston Ave 37.781433, -122.471718
    • correct: 37.780801, -122.470856
    • It isn't off by much, by half a block.
  • #852347086 377 GEARY ST
    • wrong: 1122 Post Street
    • correct: 37.787089, -122.409418
    • It is ~0.8 miles driving distant off
  • #875765715 825 GEARY ST
    • wrong: 1480 Fillmore St
    • correct: 37.786163, -122.416954
    • It is ~1.1 miles driving distant off
In [160]:
## Manually selected selected points that were approximately one block outside of Mission
Image(filename = os.path.join('.','image','mission_all_year_oneblock.png'),
      width = 500, height = 100)
Out[160]:
In [161]:
mission_out = pd.read_csv(os.path.join('.','dataset','tableau', 'mission_outside_oneblock.csv'))
print('Number of rows outside of Mission: {}'.format(mission_out.shape[0]))
## Out of all tickets with 'Mission' street_name what is the percent bad 
print('Percent of ticket population: {:0.2f}%'\
      .format(mission_out.shape[0]/
              df_merge.tick_street_name.str.contains('MISSION',regex=False).sum()*100))
Number of rows outside of Mission: 5041
Percent of ticket population: 4.23%
In [162]:
## Obtain the tickets using index from mission_out
df_merge.iloc[mission_out.F1][['ticket_number','tick_street_no', 
                             'tick_street_name', 'tick_street_suffix', 
                             'hh_gpslat', 'hh_gpslon']]\
                            .sample(n = 5, random_state = 33)
Out[162]:
ticket_number tick_street_no tick_street_name tick_street_suffix hh_gpslat hh_gpslon
199489 863739240 2267.0 MISSION ST 37.774025 -122.404630
745454 868345354 974.0 MISSION ST 37.774298 -122.404357
1200292 865517251 3041.0 MISSION ST 37.773963 -122.404488
3102175 891363535 2412.0 34TH AVE 37.741982 -122.491743
3516913 851430451 1704.0 MONTGOMERY ST NaN NaN

Correct GPS:

  • #863739240 2267 MISSION ST
    • wrong: 548 7th St 37.774025, -122.404630
    • correct: 37.760831, -122.419124
    • It is ~1.6 miles driving distant off
  • #868345354 974 MISSION ST
    • wrong: 507 7th st 37.774298, -122.404357
    • correct: 37.781411, -122.408152
    • It is ~0.7 miles driving distant off
  • #865517251 3041 MISSION ST
    • wrong: 548 7th st 37.773963, -122.404488
    • correct: 37.748540, -122.418101
    • It is ~2.6 miles driving distant off

Rough Estimates on single address (handheld only)

In [163]:
## Only look for address from handheld tickets because 
## handwritten tickets has no gps therefore it will be bias 
tmp = df_merge[df_merge.handheld]
## Find the most common address
full_street = tmp.tick_street_no.astype('str')\
    .str.cat([tmp.tick_street_name,
              tmp.tick_street_suffix], sep = ' ')
## Remove decimal due to float conversion
full_street = full_street.str.replace('\.0', '')
full_street.head()
Out[163]:
0    4420 MISSION ST
1       834 47TH AVE
2    2284 MISSION ST
3        568 02ND ST
4     22 CHABOT TERR
Name: tick_street_no, dtype: object
In [164]:
## Convert to dataframe
tmp = pd.DataFrame(full_street)
## Change name of column
tmp.columns = ['full_street']
## Add gps to determine which row is missing
tmp['gps'] = df_merge[df_merge.handheld].hh_gpslat
## Aggregate based on full_street and return total rows and
## number of nonmissing rows
tmp = tmp.groupby('full_street')['gps'].agg(['count', 'size'])\
            .sort_values(by='count', ascending = False)
tmp.columns = ['hasgps', 'total'] ## remove columns
## Calculate percent of not missing gps data
tmp['percent_complete'] = tmp['hasgps']*100//tmp['total']
tmp.to_csv('dataset/full_street_counts.csv')
tmp.head(6)
Out[164]:
hasgps total percent_complete
full_street
400 01ST ST 3193 8845 36
300 01ST ST 2449 2939 83
700 04TH ST 1564 3445 45
501 TERRY A FRANCOIS BLVD 1116 1323 84
100 01ST ST 998 1315 75
331 TOWNSEND ST 858 1381 62
In [165]:
tmp.percent_complete.plot.hist()
plt.title('Distribution of Complete GPS Data By Address')
plt.xlabel('Completeness (%)')
plt.show()
In [166]:
## If you didn't remove non-handheld tickets, you will 
## see something entirely different for 700 4TH ST 
## the completeness is 0%
tmp.loc[['700 04TH ST', '700 4TH ST']]
Out[166]:
hasgps total percent_complete
full_street
700 04TH ST 1564 3445 45
700 4TH ST 7 8 87
In [167]:
## gps from google maps search
## Looking at some of the top address
## There are two 4th st. Just seeing if that zero affects
## the geoencoding
full_street_gps = pd.DataFrame.from_dict({'full_street': ['400 01ST ST', '300 01ST ST',
                                                          '700 04TH ST',
                                                          '501 TERRY A FRANCOIS BLVD', 
                                                          '331 TOWNSEND ST'],
                                          'hh_gpslat': [37.786329, 37.787343,
                                                  37.776789, 
                                                  37.775569, 37.776578],
                                          'hh_gpslon': [-122.393263, -122.394522,
                                                  -122.394644, 
                                                  -122.387663, -122.395667], 
                                          'gps_from': 'google'})
## Repeat df by 3 times to include years
full_street_gps = pd.concat([full_street_gps]*3, ignore_index=True)
## Add year from 2015 - 2017
full_street_gps['year'] = pd.Series([2015,2016,2017]).repeat(full_street_gps.shape[0]/3).values
print(full_street_gps.shape)
full_street_gps.head()
(15, 5)
Out[167]:
full_street gps_from hh_gpslat hh_gpslon year
0 400 01ST ST google 37.786329 -122.393263 2015
1 300 01ST ST google 37.787343 -122.394522 2015
2 700 04TH ST google 37.776789 -122.394644 2015
3 501 TERRY A FRANCOIS BLVD google 37.775569 -122.387663 2015
4 331 TOWNSEND ST google 37.776578 -122.395667 2015
In [168]:
## Only data that belongs to handheld
## And extract only columns of interest
tmp_handheld = df_merge.loc[df_merge.handheld, ['ticket_number', 'issue_datetime',
                                                'tick_street_name','hh_gpslat', 
                                                'hh_gpslon']].copy() 
tmp_handheld['full_street'] = full_street ## Add full_street address
## get a subset. export to tableau. make sure to have year as well. 
## Extract tickets with address of interest
tmp_handheld = tmp_handheld[tmp_handheld.full_street.isin(full_street_gps.full_street)]
tmp_handheld.dropna(inplace = True) ## drop missing gps
## extract year from datetime
tmp_handheld['year'] = tmp_handheld.issue_datetime.dt.year
tmp_handheld.drop(columns=['issue_datetime'], inplace = True)
tmp_handheld['gps_from'] = 'ticket'
print(tmp_handheld.shape)
tmp_handheld.head(2)
(9180, 7)
Out[168]:
ticket_number tick_street_name hh_gpslat hh_gpslon full_street year gps_from
733 855034143 01ST 37.78739 -122.394688 300 01ST ST 2015 ticket
934 855034165 01ST 37.78745 -122.394487 300 01ST ST 2015 ticket
In [169]:
tmp_handheld.groupby(['year', 'full_street']).size().unstack(level = 0).plot.bar()
plt.ylabel('Number of tickets')
plt.title('Tickets by Year and Address with GPS')
plt.show()
In [170]:
## Export df for tableau
tmp_handheld.append(full_street_gps, ignore_index=True)\
    .to_csv('./dataset/tableau/tmp_handheld.csv', index = False)
In [171]:
Image(filename = os.path.join('.','image','1_400_01ST.png'),
      width = 400, height = 80)
Out[171]:
In [172]:
Image(filename = os.path.join('.','image','2_300_01ST.png'),
      width = 400, height = 80)
Out[172]:
In [173]:
Image(filename = os.path.join('.','image','3_700_4TH.png'),
      width = 400, height = 80)
Out[173]:
In [174]:
Image(filename = os.path.join('.','image','4_510_Terry.png'),
      width = 400, height = 80)
Out[174]:
In [175]:
Image(filename = os.path.join('.','image','5_331_Townsend.png'),
      width = 400, height = 80)
Out[175]:

Certain location has very high variance while other doesn't like 4TH and Terry. It is unlikely for the gps coordinates to be based on user input because their input are exactly the same. This is based on uncleaned data too.

In [176]:
del geary_out, mission_out, full_street, full_street_gps, tmp_handheld

Conculsion: There is not much I can do from here. Other than try to find the gps points that are outside of San Francisco which is wrong. But it seems like a small population therefore, not gonna worry about it. Imputation will be done with Street Address.

  • Only hand written tickets has no gps.
  • There are a few gps coordinates that are outside of san francisco like in the oceans or in salt lake city. But they are few.
  • For the exact same address there are missing gps data
  • For the exact same address with gps coordinates, they can vary highly in terms of lat and lon.
  • Between different map service, some of the address differ. Like 3699 YACHT RD is on google but it is 3699 LYON STREET on mapquest.

tick_street_no, tick_street_name, tick_street_suffix

In [177]:
print('Number of nan:')
for i in ['tick_street_no', 'tick_street_name', 'tick_street_suffix']:
    print('{} : {}'.format(i, df_merge[i].isna().sum()))
Number of nan:
tick_street_no : 35680
tick_street_name : 915
tick_street_suffix : 189779
In [178]:
## For string columns, replace nan with nothing because
## nan + string = nan which is bad
df_merge.tick_street_name.fillna(value = '', inplace = True)
df_merge.tick_street_suffix.fillna(value = '', inplace = True)

print('Number of After nan:')
for i in ['tick_street_no', 'tick_street_name', 'tick_street_suffix']:
    print('{} : {}'.format(i, df_merge[i].isna().sum()))
Number of After nan:
tick_street_no : 35680
tick_street_name : 0
tick_street_suffix : 0

Quick fix of tick_street_suffix

In [179]:
## Fixing misspelling
df_merge.tick_street_suffix.replace(to_replace = 'BLD', value = 'BLVD', inplace = True)
df_merge.tick_street_suffix.replace(to_replace = 'RD', value = 'ROAD', inplace = True)
## Make things easier to search
df_merge.tick_street_suffix.replace(to_replace = 'SQ', value = 'SQUARE', inplace = True)
## Only 3 rows and is suppose to be DRIVE
df_merge.tick_street_suffix.replace(to_replace = 'WALK', value = 'DRIVE', inplace = True)
## Has 12 rows but their street names are majority missing
df_merge.tick_street_suffix.replace(to_replace = 'BLOCK', value = '', inplace = True)
In [180]:
## Summary of suffix 
df_merge.tick_street_suffix.value_counts(dropna = False)
Out[180]:
ST        2859519
AVE        696763
           189791
BLVD        70640
DRIVE       55973
TERR        10591
COURT        9266
HWY          6728
PLACE        6101
PARK         4705
CIR          4216
ROAD         3693
LANE         3324
LOT          1684
PLAZA          31
ALLEY          30
SQUARE          8
Name: tick_street_suffix, dtype: int64

Fixing tick_street_name

In [181]:
## Total number of unique streets
print('Number of unique street names: {}'.format(df_merge.tick_street_name.unique().shape[0]))
Number of unique street names: 14570
In [182]:
## https://data.sfgov.org/Geographic-Locations-and-Boundaries/Street-Names/6d9h-4u5v
## Pulled on January 22, 2018, official san francisco street names
off_street_names = pd.read_csv(os.path.join('.', 'dataset', 'Street_Names.csv'))
## Remove unknown streets
off_street_names = off_street_names[~off_street_names.StreetName.str.contains('UNNAMED')]
## Remove streets with TI ending like 6TH AVE TI
off_street_names = off_street_names[~off_street_names.StreetName.str.contains(' TI')]
## Check percentage of inconsistent input 
off_street_names.head(2)
Out[182]:
FullStreetName StreetName StreetType PostDirection
0 01ST ST 01ST ST NaN
2 02ND AVE 02ND AVE NaN
In [183]:
## Match tick_street_name with official san francisco street names
## count the matches and divide by total tickets
print('Percent of Correct Street Name Input: {:0.2f}%'\
      .format(df_merge.tick_street_name.isin(off_street_names.StreetName).sum()/
             df_merge.shape[0] * 100))
Percent of Correct Street Name Input: 95.29%
In [184]:
## Remove non alphanumerics and then end spaces
df_merge.tick_street_name = df_merge.tick_street_name.str.replace('[^a-zA-Z0-9\s]', '').str.strip()
print('After: {}'.format(df_merge.tick_street_name.unique().shape[0]))
After: 14422
In [185]:
## Replace misspelling
def str_rep(to_replace, replace, regex = False):
    df_merge.tick_street_name.replace(to_replace = to_replace, 
                                      value = replace, inplace = True,
                                      regex = regex)
In [186]:
## BLK, BLK OF (A lot of streetnames has this in front)
str_rep(['BLK OF ', 'BLOCK OF '], '', regex = True)
str_rep(['BLK ', 'BLOCK ', 'BK '], '', regex = True)
## LAPU LAPU was LAPU-LAPU before - was taken out
str_rep('LAPULAPU', 'LAPU LAPU', regex = True)
## OFARRELL
str_rep(['O FARRELL', 'OFARRELL', "O'FARRELL", 
         'O"FARRELL', "C O'FARRELL", 'FARRELL', 
         'OFARREL', 'OFFARRELL'], 'OFARRELL')
## EMBARCADERO
str_rep(['EMBARCADERO NORTH', 'EMBARCADERO SOUTH', 'EMBARCADERO'], 'THE EMBARCADERO')
## PIER 43
str_rep('PIER431', 'PEIR 43')
## BAYSHORE
str_rep(['BAY SHORE'], 'BAYSHORE')
## 01ST
str_rep(['1ST'], '01ST')
## 02ND
str_rep(['2ND'], '02ND')
## 03RD
str_rep(['3RD', '3 RD'], '03RD')
## 04TH
str_rep(['4TH'], '04TH')
## 05TH
str_rep(['5TH'], '05TH')
## 06th
str_rep(['6TH'], '06TH')
## 07Th
str_rep(['7TH'], '07TH')
## 8TH
str_rep(['8TH'], '08TH')
## 09th
str_rep(['9TH'], '09TH')
## 16th 
str_rep(['16'], '16TH')
## 25TH
str_rep(['25TH NORTH'], '25TH')
## JOHN F KENNEDY
str_rep(['JFK'], 'JOHN F KENNEDY', regex = True)
## MARTIN LUTHER KING JR
str_rep(['MLK JR', 'MLK DR', 'MLKD', 'MLKF'], 'MARTIN LUTHER KING JR', regex = True)
str_rep('MLK', 'MARTIN LUTHER KING JR', regex = True)
str_rep(['MARTIN LUTHER KING'], 'MARTIN LUTHER KING JR')
## DR CARLTON B GOODLETT
str_rep(['CARLTON B GOODLETT', 'CARLTON B GOODLET', 
         'CARLTON B. GOODLETT', 'DR CARLTON B. GOODLETT',
         'CARLTONN B GOODLETT'], 'DR CARLTON B GOODLETT')
## BUENA VISTA AVE EAST
str_rep('BUENA VISTA AVE  EAST', 'BUENA VISTA AVE EAST')
## GREAT HIGHWAY
str_rep(['GREAT HIGHWAY', 'GRT HWY'], 'GREAT HWY', regex = True)
str_rep('GRT', 'GREAT HWY')
str_rep(['LOWER GREAT HIGHWAY', 'LOWER GREAT HWY', 'GRT HYW'], 'GREAT HWY')
str_rep(['GRT HWY 1000', 'GREAT HWY 1000'], '1000 GREAT HWY')
## JUNIPERO SERRA BLVD
str_rep(['JUNIPERO SERRA  BLVD OFF', 'JUNIPERO SERRA BLVD OFF R', 
         'JUNIPERO SERRA BLVD ON RA', 'JUNIPERO', 
         'JUNIPERO SERRA  BLVD ON R'], 'JUNIPERO SERRA BLVD')
## CECILIA
str_rep('CECELIA', 'CECILIA')
## NORTHPOINT
str_rep('NORTHPOINT', 'NORTH POINT')
## SERGEANT JOHN V YOUNG
str_rep('SERGEANT JOHN V YOUNG', 'SGT JOHN V YOUNG')
## SUNVIEW DRIVE
str_rep('SUNVIEW WAY', 'SUNVIEW DRIVE')
## CALIFORNIA
str_rep('CALIFORINA', 'CALIFORNIA')
## AVENUE OF THE PALMS
str_rep('AVE OF THE PALMS', 'AVENUE OF THE PALMS')
## BUCHANAN
str_rep('BUCHANNAN', 'BUCHANAN')
## PALOV
str_rep('PALOV', 'PALOU')
## POTRERO
str_rep('POTERO', 'POTRERO', regex=True)
## LAGUNA HONDA BLVD
str_rep('LAGUNA HONDA BVLD', 'LAGUNA HONDA BLVD')
## CHRISTMAS TREE
str_rep(['CHRISTMAS TREE', 'CHRISTMAS TREE PT'], 'CHRISTMAS TREE POINT')
## WEST VIEW
str_rep('VIEW', 'WEST VIEW')
## FERRY BLDG EMBARCADERO NORTH
str_rep('FERRY BLG EMBARCADERO NORTH', 'FERRY BLDG EMBARCADERO NORTH')
##
str_rep('FERRY BLG EMBARCADERO SOUTH', 'FERRY BLDG EMBARCADERO SOUTH')
## LOT BUS STOP
str_rep(['LOT BUS', 'LOT BUS ONLY', 
         'LOT BUS RED', 'LOT RED AND BUS ZONE'], 'LOT BUS STOP')
## LINCOLN WAY
str_rep('LINCOLN WY', 'LINCOLN WAY')
## GEARY BLVD
str_rep('GEARY BVLD', 'GEARY BLVD')
str_rep('GERAY', 'GEARY')
## MIDDLEPOINT
str_rep('MIDDLEPOINT', 'MIDDLE POINT', regex = True)
## STOW LAKE
str_rep('STOWLAKE', 'STOW LAKE')
## EL DORADO NORTH
str_rep('EL DORADO NORTH', 'EL DORADO')
## MOSCOW ST
str_rep('MOSCOW CROCKER AMAZON', 'MOSCOW ST')
## MIDDLE WEST DRIVE
str_rep(['MIDDLE DRIVE WEST', 'MIDDLE DR WEST'], 'MIDDLE WEST DRIVE')
## BLYTHDALE
str_rep('BLYTHEDALE', 'BLYTHDALE')
## 
str_rep('SAN BURNO', 'SAN BRUNO')
## MARY TERESA
str_rep('MARY THERESA', 'MARY TERESA')
## BUENA VISTA EAST AVE
str_rep('BUENA VISTA AVE EAST', 'BUENA VISTA EAST')
## BUENA VISTA WEST
str_rep('BUENA VISTA AVE WEST', 'BUENA VISTA WEST')
## CONSERVATORY DR EAST
str_rep('CONSERVATORY DR EAST', 'CONSERVATORY EAST DR')
str_rep('CONSERVATORY DR WEST', 'CONSERVATORY WEST DR')
## Mission Bay Blvd North
str_rep(['MISSION BAY BLVD (NORTH)', 'MISSION BAY BLVD  NORTH', 
         'MISSION BAY BLVD NORTH'], 'MISSION BAY NORTH BLVD')
## Mission Bay Blvd South
str_rep(['MISSION BAY BLVD (SOUTH)', 'MISSION BAY BLVD SOUTH'], 
        'MISSION BAY SOUTH BLVD')
## FERRY BUILDING EMBARCADERO SOUTH
str_rep(['FFERRY BLG EMBARCADERO SOUTH', 'FERRY EMBARCADERO SOUTH', 
         'FERRY BLD EMBARCADERO SOUTH'], 'FERRY BLDG EMBARCADERO SOUTH')
## FERRY BLDG EMBARCADERO NORTH
str_rep(['FERRY BUILDING', ' FERRY BLD EMBARCADERO NORTH', 
         'FFERRY BLG EMBARCADERO NORTH', 'FFERRY BLD EMBARCADERO NORTH', 
         'FERRY BUILDING AND EMBARCADERO'], 'FERRY BLDG EMBARCADERO NORTH')
## YATCH ROAD
str_rep('YACH', 'YACHT')
str_rep(['YACHT RD SAN FRANCISCO', 'YATCH RD MARINA', 
         'YACHT RD MC', 'YACHT RD AND MARINA', 
         'YACHT RD MARINA GREEN PLOT WES'], 'YACHT ROAD')
## MCALLISTER
str_rep('MC ALLISTER', 'MCALLISTER')
## UNITED NATIONS
str_rep(['UNITED NATION', 'UN'], 'UNITED NATIONS')
In [187]:
## Fixing Pier specifically
## Stop when count was at 3. So 4 <= corrected
## This leaves ~ 132 rows uncorrected and ~99 unique street names
str_rep(['PIER 1 EMBARCADERO NORTH', 'PIER 01 EMBARCADERO NORTH', 
         'PIER1 EMBARCADERO NORTH', 'PIER1 1-2 EMBARCADERO NORTH'
         'PIER 1 1-2 EMBARCADERO NORTH', 'PIER EMBARCADERO NORTH',
         'PIER 1-1', 'PIER 1 EMBARCADERO SOUTH', 'PIER 1'], 'PIER 01')

str_rep(['PIER 02 EMBARCADERO SOUTH', 'PIER 02 EMBARCADERO NORTH'], 'PIER 02')

str_rep(['PIER 03 EMBARCADERO NORTH', 'PIER 03 EMBARCADERO SOUTH', 
         'PIER3 EMBARCADERO NORTH', 'PIER 3 EMBARCADERO NORTH', 
         'PIER03 EMBARCADERO NORTH', 'PIER 3'], 'PIER 03')

## PIER 05
str_rep(['PIER 05 EMBARCADERO NORTH', 'PIER 5 EMBARCADERO NORTH',
         'PIER5 EMBARCADERO NORTH', 'PIER 0 5'], 'PIER 05')

str_rep(['PIER 07 EMBARCADERO NORTH'], 'PIER 07')

str_rep(['PIER 09 EMBARCADERO NORTH'], 'PIER 09')

str_rep(['PIER 15 EMBARCADERO NORTH'], 'PIER 15')

str_rep(['PIER 17 EMBARCADERO NORTH'], 'PIER 17')

str_rep(['PIER 23 EMBARCADERO NORTH'], 'PIER 23')

str_rep(['PIER 24 EMBARCADERO SOUTH'], 'PIER 24')

str_rep(['PIER 26 EMBARCADERO SOUTH'], 'PIER 26')

str_rep(['PIER 27 EMBARCADERO NORTH'], 'PIER 27')

str_rep(['PIER 28 EMBARCADERO SOUTH', 'PIER 28 EMBARCADERO NORTH', 
         'PIER 28 EMBARCADERO NORTH'], 'PIER 28')

str_rep(['PIER 30 EMBARCADERO SOUTH', 'PIER30 EMBARCADERO SOUTH'], 'PIER 30')

str_rep(['PIER 31 EMBARCADERO NORTH'], 'PIER 31')

str_rep(['PIER 33 EMBARCADERO NORTH', 'PIER 33 EMBARCADERO', 
         'PIER 33 EMBARCADERO SOUTH', 'PIER 33 AND EMBARCADERO', 
         'PIER 33 AND HALF', 'PIER33 EMBARCADERO NORTH'], 'PIER 33')

str_rep(['PIER 35 EMBARCADERO NORTH'], 'PIER 35')

str_rep(['PIER 39 THE EMBARCADERO', 'PIER39 EMBARCADERO NORTH'], 'PIER 39')

str_rep(['PIER41 EMBARCADERO NORTH', 'PIER41 THE EMBARCADERO', 
         'PIER 41 THE EMBARCADERO', 'PIER 41 EMBARCADERO NORTH'], 'PIER 41')

str_rep(['PIER43 EMBARCADERO NORTH', 'PIER 43 EMBARCADERO NORTH', 
         'PIER 43 THE EMBARCADERO', 'PIER43 THE EMBARCADERO', 
         'PIER43 1-2 EMBARCADERO NORTH', 'PIER 431', 'PEIR 43'], 'PIER 43')

str_rep(['PIER 45 THE EMBARCADERO', 'PIER45 THE EMBARCADERO', 
         'PIER 45 EMBARCADERO NORTH'], 'PIER 45')

str_rep(['PIER 52LOT TERRY A FRANCOIS'], 'PIER 52')
In [188]:
## Manually Input GPS
str_rep(['YACHT RD MARINA GREEN P', 'YACHT RD M', 
         'YACHT RD MG', 'YACHT RD MARINA GREEN PARKING', 
         'YACHT HARBOR', 'YACHT RD MARINA GREEN', 
         'YACHT RD AND MARINA GREEN P', 'YACHT RD MARINA GREEN PLOT EAS', 
         'YACHT RD AND MARINA GRN', 'YACHT RD  MARINA',
         'YACHT RD MARINA GREEN PLOT'], 'YACHT RD MARINA')
## for 110 YACHT ROAD
df_merge.loc[df_merge.tick_street_name == 'YACHT RD MARINA', 'hh_gpslat'] = 37.806935
df_merge.loc[df_merge.tick_street_name == 'YACHT RD MARINA', 'hh_gpslon'] = -122.447542
df_merge.loc[df_merge.tick_street_name == 'YACHT RD MARINA', 'imputed'] = True
## Match with the other addresses
str_rep('YACHT RD MARINA', 'YACHT ROAD')

## (#count) ALL nan
### (#66) CATALINA AND FAIRFAX = 37.736284, -122.380847
## FAIRFAX has 320 while CATALINA 76 tickets, but FAIRFAX is 6x longer
## address is 209 Fairfax and 221 Catalina
df_merge.loc[df_merge.tick_street_name == 'CATALINA AND FAIRFAX', 
             'hh_gpslat'] = 37.736284
df_merge.loc[df_merge.tick_street_name == 'CATALINA AND FAIRFAX', 
             'hh_gpslon'] = -122.380847
df_merge.loc[df_merge.tick_street_name == 'CATALINA AND FAIRFAX',
             'imputed'] = True
df_merge.loc[df_merge.tick_street_name == 'CATALINA AND FAIRFAX', 
             'tick_street_name'] = '221 CATALINA'
## (#69) JOHN F KENNEDY AND GREAT HWY = 37.769196, -122.510987
## it is 1000 GREAT HWY
df_merge.loc[df_merge.tick_street_name == 'JOHN F KENNEDY AND GREAT HWY', 
             'hh_gpslat'] = 37.769196
df_merge.loc[df_merge.tick_street_name == 'JOHN F KENNEDY AND GREAT HWY', 
             'hh_gpslon'] = -122.510987
df_merge.loc[df_merge.tick_street_name == 'JOHN F KENNEDY AND GREAT HWY',
             'imputed'] = True
df_merge.loc[df_merge.tick_street_name == 'JOHN F KENNEDY AND GREAT HWY', 
             'tick_street_name'] = '1000 GREAT HWY'
## (#35) MARTIN LUTHER KING JR AND LINCOLN = 37.764121, -122.508909
df_merge.loc[df_merge.tick_street_name == 'MARTIN LUTHER KING JR AND LINCOLN', 
             'hh_gpslat'] = 37.764121
df_merge.loc[df_merge.tick_street_name == 'MARTIN LUTHER KING JR AND LINCOLN', 
             'hh_gpslon'] = -122.508909
df_merge.loc[df_merge.tick_street_name == 'MARTIN LUTHER KING JR AND LINCOLN',
             'imputed'] = True
df_merge.loc[df_merge.tick_street_name == 'MARTIN LUTHER KING JR AND LINCOLN', 
             'tick_street_name'] = '4750 LINCOLN'
## refresh gps_missing
df_merge['gps_missing'] = df_merge.hh_gpslat.isna()
In [189]:
## Adding Suffix to be consistent
## There is only one options for these that is why I can do this
df_merge.loc[df_merge.tick_street_name == 'CROCKER AMAZON', 'tick_street_suffix'] = 'PARK'
## SAHFTER
df_merge.loc[df_merge.tick_street_name == 'SAHFTER', 'tick_street_suffix'] = 'AVE'
## THORTON
df_merge.loc[df_merge.tick_street_name == 'THORTON', 'tick_street_suffix'] = 'AVE'
## LOT BUS STOP
df_merge.loc[df_merge.tick_street_name == 'LOT BUS STOP', 'tick_street_suffix'] = ''
## CATALINA
df_merge.loc[df_merge.tick_street_name == 'CATALINA', 'tick_street_suffix'] = 'ST'
## COLUMBIA
df_merge.loc[df_merge.tick_street_name == 'COLUMBIA', 'tick_street_suffix'] = 'SQUARE'

LOOK FOR STREET NAME THAT DOESN'T MATCH WITH OFFICIAL OR APPROVED STREET NAMES

In [190]:
## Bin unique names together
street_counts = df_merge.tick_street_name.value_counts().reset_index(name = 'counts')
street_counts.columns = ['street', 'counts'] ## rename columns
print('Number of unique streetnames after: {}'.format(street_counts.shape[0]))
Number of unique streetnames after: 13558
In [191]:
## Determine which streets does not exist 
## Gather all the names because the naming in the dataset is inconsistent 
streetname = off_street_names.StreetName
streetname = streetname.append(off_street_names.FullStreetName, ignore_index = True)
## Our dataset name in this format: Street Direction Suffix
tmp = off_street_names.StreetName + ' ' + off_street_names.PostDirection
## nan + String == nan so got to remove nan
streetname = streetname.append(tmp[tmp.notna()], ignore_index = True)
In [192]:
## manual input of approved streetname values
tmp = ['S', 'BAYSHORE', 'PARKLOT 2450 CALIFORNIA', 'PARKLOT 3252 PIERCE', 'PARKLOT 421 18TH', 
       'PARKLOT 5732 GEARY', 'PARKLOT 324 8TH', 'PARKLOT 1275 20TH', 'PARKLOT 1325 8TH', 
       'PARKLOT 807 ULLOA', 'PARKLOT 4116 18TH', 'YACHT ROAD', 'PARKLOT 457 CASTRO', 
       'PARKLOT 174 WEST PORTAL', 'PARKLOT 330 9TH', 'PARKLOT 20 NORTON', 'PARKLOT 985 POTRERO',
       'PARKLOT 4061 24TH', 'PARKLOT 3255 24TH', 'LECH WALESA', 'PARKLOT 25 FELTON', 
       'PARKLOT 1340 7TH', 'FERRY BLDG EMBARCADERO SOUTH', 'FERRY BLDG EMBARCADERO NORTH',
       'SUNVIEW DRIVE', 'FERRY BLG EMBARCADERO NORTH', 'VIEW', 'CROCKER AMAZON', 'SAHFTER', 
       'SGT JOHN V YOUNG', 'LOT BUS STOP', 'PALACE OF FINE ARTS', 'MARY', 'LOT RED ZONE', 
       'PARKLOT 2500 OCEAN', 'THORTON', 'AVE B', 'ST FRANCIS AVE', 'ST MARYS AVE',
       'ST FRANCIS BLVD', 'ST CHARLES AVE', 'ST ELMO WAY', 'MISSION BAY NORTH BLVD',
       'MISSION BAY SOUTH BLVD', 'COLUMBIA', 'LAPU LAPU', 'LAPU-LAPU', '1000 GREAT HWY', 
       '221 CATALINA', 'PARKLOT 3000 OCEAN', 'UNITED NATIONS']
## Add in the PIERs
tmp = tmp + ['PIER ' + str(i).rjust(2,'0') for i in range(1,53)]
## Then concate all values into streetname
streetname = streetname.append(pd.Series(tmp), ignore_index = True)
print('Total number of recognized street_name: {}'.format(streetname.shape[0]))
Total number of recognized street_name: 5232
In [193]:
## Which street matched with official street naming
match_street = street_counts.street.isin(streetname)
## Manually Look at mismatch streets
print('Total rows not recognized: {}'.format(street_counts[~match_street].counts.sum()))
street_counts[~match_street].to_csv('dataset/unmatched_street.csv')
## Street names that were recognized 
street_counts[match_street].to_csv('dataset/match_street.csv')
print('Percent rows recognized: {:0.2f}%'.format(street_counts[match_street].counts.sum()/
                                           df_merge.shape[0] * 100))
Total rows not recognized: 24655
Percent rows recognized: 99.37%

Using tableau to fix 3 streets

In [194]:
## Trying to understand streetname with structure: ###ST 
tmp = street_counts[~match_street]
nnnst = tmp[tmp.street.str.contains('^[0-9]+ST')]
print(nnnst.shape)
print('Total number of rows before: {}'.format(nnnst.counts.sum()))
## Only care about streetnames > 5
nnnst = nnnst.query('counts > 5').copy()
print(nnnst.shape)
print('Total number of rows after: {}'.format(nnnst.counts.sum()))
(389, 2)
Total number of rows before: 4238
(176, 2)
Total number of rows after: 3742
In [195]:
## Obtain rows with those streetnames and gps is not missing
df_nnnst = df_merge[(~df_merge.gps_missing) & df_merge.tick_street_name.isin(nnnst.street)]
df_nnnst.shape
df_nnnst.to_csv('dataset/tableau/street_nnnST.csv') ## export for tableau visualization

There wasn't a solid pattern where I could automatically fix them like using a street suffix of street_name range.

In [196]:
## Manually grouped clusters together. They were obvious because they are neatly on the street
Image(filename = os.path.join('.','image','cleaning_nnnst.png'), width = 500, height = 100)
Out[196]:
In [197]:
## Import manually grouped streets which was exported from tableau 
nnnst_tableau = pd.read_csv(os.path.join('.','dataset','tableau', 'street_nnnST_tableau_export.csv'))
print('Before: {}'.format(nnnst_tableau.shape[0]))
## throw away rows that are other (unclassified)
nnnst_tableau = nnnst_tableau[nnnst_tableau['Ticket Number (group)'] != 'Other']
print('Number of rows to be fixed: {}'.format(nnnst_tableau.shape[0]))
Before: 2392
Number of rows to be fixed: 1507
In [198]:
## Replace tick_street_name using index of the tickets
nnnst_idx = nnnst_tableau.F1 ## index of ticket to fix
## index for column name
street_idx = df_merge.columns.get_loc('tick_street_name')
cor_street = nnnst_tableau['Ticket Number (group)'].unique()
## Verify things are working correctly
print(cor_street)
print('Before replacement: {}'\
      .format(df_merge.iloc[nnnst_idx, street_idx].isin(cor_street).sum()))
## Replace ticket street name
df_merge.iloc[nnnst_idx, street_idx] = nnnst_tableau['Ticket Number (group)'].values
## Remove any suffix if there are any
df_merge.iloc[nnnst_idx, df_merge.columns.get_loc('tick_street_suffix')] = ''
## Verify replacement is small number as to be fixed rows
print('After replacement: {}'\
      .format(df_merge.iloc[nnnst_idx, street_idx].isin(cor_street).sum()))
['ST FRANCIS AVE' 'ST MARYS AVE' 'ST FRANCIS BLVD' 'ST CHARLES AVE'
 'ST ELMO WAY']
Before replacement: 0
After replacement: 1507
In [199]:
## wrap up
del nnnst, df_nnnst, nnnst_tableau, nnnst_idx, street_idx, cor_street
del streetname, match_street, street_counts

Random notes:

  • Some streets are harder to correct because they might have the similar street names but differ by just the suffix like ST or AVE. So when a row has no suffix and no gps coordinate. There is no way to guess what street it actually is.
  • 2ST is points to a bunch of different places like St Charles Ave, St Mary Ave, St Joseph's Ave
  • 110 YACHT RD MARINA is most likely 110 Yacht rd because of green zone marina violation. 37.806935 -122.447542
  • Lech Walesa isn't a street based on the street name list I got, but google map says otherwise
  • Not Sure what POINT is. Need to go further in detail
  • ???ST format streetname
    • 316ST is most likely 316 St Joseph's Ave after testing 5 GPS coordinates
    • 319ST is most likely 319 St Joseph's Ave after testing 5 GPS coordinates
    • 250ST is most likely 250 St Joseph's Ave after testing 5 GPS coordiantes
    • 16ST is most likely 16 St Joseph's Ave after testing 5 GPS coordinates
    • 100ST is most likely 100 St Joseph's Ave, but after testing 9 GPS coordinates, 1 was at St Charles and another at St Elmo Way
    • ???ST is around 1700 rows. Manually Corrected them.
  • LOT BUS STOP and the others has the same violation codes so are the same
  • 0 checked 3 gps and they are all over the place
  • CROCKER AMAZON Park address is 799 Moscow
  • MARTIN LUTHER KING JR AND LINCOLN is most likely 4798 Lincoln Way because you cannot even park on MLK
  • LOT RED ZONE is all over the map
  • MARY THERESA which is MARY TERESA has 12 gps coord but out of them, 4 of them are totally off.
  • suffix ST are usually downtown like 2nd St and Ave are usually in sunset like 2nd Ave
  • Some time directional position is important like BUENA VISTA AVE EAST/WEST vs MISSION BAY BLVD NORTH/SOUTH which are just a few yards apart from each other
  • Changed BUENA VISTA AVE WEST (actual correct spelling) to BUENA VISTA WEST because it is just easier to handle with setup of street_no + street_name + street_sufix
  • FERRY BLDG EMBARCADERO NORTH and FERRY BLDG EMBARCADERO SOUTH are basically the same location but differ by like 100 ft. It is just one end of the building vs the other. Keep as is?
  • There is an ISIS street
  • AVE B is on Treasure Island
  • Geary ST becomes Geary Blvd after Van ness
  • COLUMBIA is actually COLUMBIA SQUARE ST. sometime the suffix is square and st.
  • 607 TERRACE WALK has gps coordinates that is somewher else. Maybe it is because this address doesn't exist since TERRACE DRIVE street number goes up to 199.
In [200]:
## Strip white space from street_name and street_suffix
df_merge.tick_street_name = df_merge.tick_street_name.str.strip()
df_merge.tick_street_suffix = df_merge.tick_street_suffix.str.strip()

Fixing tick_street_no

In [201]:
df_merge.tick_street_no.describe(percentiles=[.75, .95, .99]).astype('int')
Out[201]:
count    3887383
mean        1149
std         1259
min            1
50%          712
75%         1711
95%         3476
99%         4856
max        99935
Name: tick_street_no, dtype: int32
In [202]:
df_merge.tick_street_no.plot.box()
plt.ylabel('Street Number')
plt.title('The range of Street Number')
plt.show()

Longest Street/Highest Street Number in San Francisco

  • Mission street - Goes up to ~ 6000
  • Geary Blvd - Goes up to ~ 8500
  • Thrid Street - Goes up to ~ 6700
  • California Street - Goes up to ~ 6000
  • Alemany Boulevard - Goes up to ~ 4000
  • Market Street - Goes up to ~ 4000
In [203]:
## Look for outliers
street_outlier = df_merge.tick_street_no > 8500
print('Number of outlier: {:,}'.format(street_outlier.sum()))

df_merge.tick_street_no[street_outlier].plot.hist()
plt.title('Closer look at outliers greater than 8500')
plt.show()
Number of outlier: 1,477
In [204]:
## Check A random sample and found that street number is
## off by a factor of 10
## 16999.0 BRODERICK 37.784932 -122.441568
## 8511.0 GOLDEN GATE 37.780462 -122.424600
## 29912.0 LAWTON 37.757248 -122.494125
## 69900.0 CALIFORNIA 37.783620 -122.491630
## 34099.0 MISSION 37.742153 -122.422218
df_merge.loc[street_outlier, ['tick_street_no', 'tick_street_name',
                              'hh_gpslat', 'hh_gpslon']].head()
Out[204]:
tick_street_no tick_street_name hh_gpslat hh_gpslon
456 24470.0 VAN NESS 37.798245 -122.424038
2755 29077.0 24TH 37.752730 -122.410423
3089 24380.0 WASHINGTON 37.791678 -122.434265
3688 16044.0 SANCHEZ 37.743260 -122.428958
13414 23390.0 IRVING 37.763307 -122.483357
In [205]:
## Remove a factor of 10
df_merge.loc[street_outlier, 'tick_street_no'] = df_merge.loc[street_outlier, 'tick_street_no']//10
In [206]:
## Refresh outlier list
street_outlier = df_merge.tick_street_no > 8500
print('Possible still outliers: {}'.format(street_outlier.sum()))
Possible still outliers: 22
In [207]:
## View the last few outliers
## It looks like the street_no are basically all wrong
## 9197.0 ALEMANY 37.720790 -122.439123 is actually 1979
## 9482.0 GEARY 37.780898 -122.465185 is actually 4315
## 8565.0 16TH 37.765213 -122.417205 is actually 2878
## 9828.0 COLLINGWOOD 37.757708 -122.435867 is actually 201
## 9769.0 40TH 37.760795 -122.499570 is actually 1391
df_merge.loc[street_outlier, ['tick_street_no', 'tick_street_name',
                              'hh_gpslat', 'hh_gpslon']].head()
## I will leave them as is. Since small population and
## no easy way. 
Out[207]:
tick_street_no tick_street_name hh_gpslat hh_gpslon
98596 9122.0 BERRY 37.775288 -122.393870
271443 8527.0 MIRAMAR 37.723757 -122.458235
563719 9563.0 ALABAMA 37.760592 -122.412053
600157 8660.0 45TH 37.777087 -122.505975
903897 9482.0 GEARY 37.780898 -122.465185

Create Full Address of ticket

Combine street_name + street_suffix

Sometime street_name will contain suffix and sometime street_suffix would be nan which is important for some street where there same street_names

In [208]:
## make sure there are no nan
print(df_merge.tick_street_name.isna().sum(), 
      df_merge.tick_street_suffix.isna().sum())
0 0
In [209]:
# ## Summarize tickets by street name and suffix
# tmp = df_merge.groupby(['tick_street_name', 'tick_street_suffix'])\
#     .agg({'hh_gpslat':['size', 'count'], 'handheld': 'sum'})
# tmp.columns = ['total', 'n_gps', 'n_handheld']
# tmp.reset_index(inplace = True)

# ## Get the total tickets by street name
# tmp2 = df_merge.groupby('tick_street_name')\
#     .size().to_frame(name = 'max').reset_index()\
#     .sort_values(by = 'max', ascending = False)
    
# ## Combine them together. Sort by total ticket by street name
# tmp3 = pd.merge(tmp, tmp2, how = 'left', on = 'tick_street_name')\
#     .sort_values(by='max', ascending = False)

# tmp3.to_csv('./dataset/street_name_suffix_count.csv')
# tmp3.head(7)
# del tmp, tmp2, tmp3 ## wrap up

The main discrepancy is due to the difference between written and handheld tickets. Some of them are obvious because there are only one suffix while some has a few so it will take a lot of effort and seemingly little return.

But it is important to note, that if this is not corrected it will affect hand written tickets the most because they have the highest variance and if gps were to be imputed, they will most likely remain missing. However, hand written tickets are a small population (3.5%).

In [210]:
## Combine street name and suffix and then remove white space if there
## and double space
partial_street = df_merge.tick_street_name.str\
    .cat(df_merge.tick_street_suffix, sep = ' ')\
    .str.strip().str.replace('  ', ' ')
In [211]:
## Fix repeated suffix like ' ST ST' with ' ST'
## Need space in front because 1ST ST might be replaced
fix = [(i*2, i) for i in [' ST', ' AVE', ' BLVD', 
                          ' DRIVE', ' TERR', ' COURT', 
                          ' HWY', ' PLACE', ' PARK', 
                          ' CIR', ' ROAD', ' LANE', 
                          ' LOT', ' PLAZA', ' ALLEY', 
                          ' SQUARE']]
In [212]:
print('Before fixing')
for d, s in fix:
    print('{}: {}'.format(d, partial_street.str.contains(d).sum()))
Before fixing
 ST ST: 16
 AVE AVE: 6
 BLVD BLVD: 0
 DRIVE DRIVE: 0
 TERR TERR: 0
 COURT COURT: 0
 HWY HWY: 132
 PLACE PLACE: 0
 PARK PARK: 0
 CIR CIR: 0
 ROAD ROAD: 11
 LANE LANE: 0
 LOT LOT: 1
 PLAZA PLAZA: 0
 ALLEY ALLEY: 0
 SQUARE SQUARE: 0
In [213]:
print('After fixing')
to_fix = [fix[0], fix[1], fix[6], fix[10], fix[12]]
## Only a few needs to be fixed
for d, s in to_fix:
    partial_street = partial_street.str.replace(d, s)
    print('{}: {}'.format(d, partial_street.str.contains(d).sum()))
After fixing
 ST ST: 0
 AVE AVE: 0
 HWY HWY: 0
 ROAD ROAD: 0
 LOT LOT: 0

street_no + partial_street

In [214]:
## Remove nan and '.0' decimal
tmp = df_merge.tick_street_no.astype('str').str.replace('nan', '')
tmp = tmp.str.replace('\.0', '')
In [215]:
street_full = tmp.str.cat(partial_street, sep = ' ')
street_full.head()
Out[215]:
0    4420 MISSION ST
1       834 47TH AVE
2    2284 MISSION ST
3        568 02ND ST
4     22 CHABOT TERR
Name: tick_street_no, dtype: object
In [216]:
## Add compiled street into df
df_merge['address'] = street_full.str.strip()
df_merge['street_partial'] = partial_street
In [217]:
## wrap up
del street_outlier, partial_street, fix, to_fix, street_full, d, s

Conclusion:

Tick_street_no

  • Corrected some of the outliers that were correct just a factor larger (i.e an extra digit at the end). However there are errors that are correctable like an entirely wrong input of the address. Therefore, the column as majority left as is.
  • Suffix discrepancy are usually from handheld vs handwritten tickets.

Tick_street_name

  • Only ~5% of input does not match official San Francisco street names. The main discrepancy is the use of place name like PIER 03, intersections like MLK and GRT HWY, abbrevivations or including suffix in street_name.
  • Sometime the whole address is inside the this column like 'LEAVENWORTH 1200', '33RD AVE SF CA'
  • There are special symbols in the address
  • List of weird input: 10THG, 001ST, 10 TH, 12 PIER EMBARCADOR NORTH,
  • Imcomplete address like 'CONSERVATORY AND'
  • Name of a place rather than street like 'MANDELL PLAZA' which is spelled wrong (MENDELL*)
  • Unofficial abbreviation like 'UNITED NATIONS' = 'UN'
  • It can be quit confusing sometime because user input are inconsistent like most common is to use the full street address, sometime two street intersection, name of the place like Tea Garden, or nonsensical 350ST (???ST) which have many different streets group togeter.
  • Inconsistent naming like PARKLOT ## OCEAN, OCEAN ## PARKLOT
  • VIN in column 2GCCH31JX148026

Tick_street_suffix

  • Non exisitng suffix like BLOCK
  • Some are missing which is bad because gps coordinates are missing and there are multiple options like 16th ST or 16TH AVE

Impute Missing GPS

In [218]:
def sum_street(x):
    d = {}
    ## total number of tickets
    d['total'] = x.shape[0]
    ## number of tickets with missing coordinates
    d['no_gps'] = x.gps_missing.sum()
    ## number of tickets with gps coordinates
    d['has_gps'] = d['total'] - d['no_gps']
    
    ## percent complete 
    d['pct_complete'] = d['has_gps'] * 100//d['total'] 
    ## default gps for address
    lat = np.nan
    lon = np.nan
    
    ## bypass if gps is all or none missing
    ## because cannot or no need to impute
    if d['pct_complete'] != 0 and d['pct_complete'] != 100:
        lat = x.hh_gpslat.median()
        lon = x.hh_gpslon.median()

    d['median_lat'] = lat
    d['median_lon'] = lon
    return pd.Series(d, index = ['total', 'has_gps', 'no_gps', 
                                 'pct_complete', 'median_lat', 
                                 'median_lon'])
In [219]:
## Create imputation table
address_gps_exact = df_merge.groupby('address').apply(sum_street)
In [220]:
## Calculate cannot be imputed due to zero gps coordinate
print('Number of tickets that cannot be imputed: {:,}'\
      .format(int(address_gps_exact.loc[address_gps_exact.pct_complete == 0, 'no_gps'].sum())))
## Extract the ones that cannot be imputed to be manually fixed
address_gps_exact_fail = address_gps_exact[address_gps_exact.pct_complete == 0]
## Keep rows with at least 100
address_gps_exact_fail = address_gps_exact_fail[address_gps_exact_fail.total > 100]
## Sort for easy visual
address_gps_exact_fail.sort_values(by = 'total', ascending = False, inplace = True)
print('N rows: {}'.format(address_gps_exact_fail.shape[0]))
# address_gps_exact_fail.to_csv('dataset/address_gps_exact_fail.csv')
address_gps_exact_fail.head(3)
Number of tickets that cannot be imputed: 198,209
N rows: 57
Out[220]:
total has_gps no_gps pct_complete median_lat median_lon
address
50 DRUMM 1919.0 0.0 1919.0 0.0 NaN NaN
331 TOWNSEND 1795.0 0.0 1795.0 0.0 NaN NaN
3950 SCOTT ST 1157.0 1.0 1156.0 0.0 NaN NaN
In [221]:
## Remove rows where gps is na which are the 0% and 100% missing address
address_gps_exact.dropna(subset=['median_lat', 'median_lon'], inplace=True)
print('Number of address before dropping: {:,}'.format(address_gps_exact.shape[0]))

## Keep if count is greater than mininum ticket count or completeness
min_tickets = 30
min_complete = 75
min_count_idx = address_gps_exact.has_gps > min_tickets
min_compt_idx = address_gps_exact.pct_complete > min_complete

print('Number of address after dropping: {:,}'.format((min_count_idx | min_compt_idx).sum()))
print('===============')
print('Number of tickets affected by drop: {:,}'\
      .format(int(address_gps_exact.loc[~(min_count_idx | min_compt_idx), 'no_gps'].sum())))

## Drop rows that doesn't meet threshold
address_gps_exact = address_gps_exact[(min_count_idx | min_compt_idx)]

print('Number of tickets captured: {:,}'\
      .format(int(address_gps_exact.no_gps.sum())))
Number of address before dropping: 227,486
Number of address after dropping: 53,333
===============
Number of tickets affected by drop: 793,345
Number of tickets captured: 427,789
Manually fix a few address in lookup table
In [222]:
## gps from google maps and mapquest
## found out that mapquest and google maps gives different 
## gps lat lon for the same address. only slightly different
## google tend to point to buildings while mapquest streets
fix = [('50 DRUMM', (37.795164, -122.396633)),
       ('50 DRUMM ST', (37.795164, -122.396633)),
       ('300 TOWNSEND', (37.777225, -122.394845)),
       ('301 TOWNSEND', (37.776836, -122.395335)),
       ('310 TOWNSEND', (37.776804, -122.395407)),
       ('311 TOWNSEND', (37.776665, -122.395533)),
       ('330 TOWNSEND', (37.776523, -122.395786)),
       ('331 TOWNSEND', (37.776567, -122.395695)),
       ('110 YACHT ROAD', (37.806744, -122.447658)),
       ('99 YACHT ROAD', (37.806924, -122.446578)),
       ('3950 SCOTT ST', (37.806295, -122.442323)),
       ('3950 SCOTT', (37.806295, -122.442323)),
       ## This SF city hall
       ('DR CARLTON B GOODLETT', (37.779416, -122.418441)),
       ## Google pointed to the hosiptal while mapquest to the road
       ('375 LAGUNA HONDA', (37.747911, -122.458602)),
       ('375 LAGUNA HONDA BLVD', (37.747911, -122.458602)),
       ('780 MARKET', (37.786022, -122.405583)),
       ('301 MARKET', (37.792422, -122.397442)),
       ('1490 MARKET', (37.775373, -122.419027)),
       ('333 MARKET', (37.792060, -122.397885)),
       ('450 GOLDEN GATE', (37.781361, -122.418005)),
       ('1000 VAN NESS', (37.785061, -122.421182)),
       ('3500 VAN NESS', (37.807386, -122.426409)),
       ('800 JAMESTOWN', (37.716293, -122.390101)),
       ('1001 POTRERO', (37.755789, -122.406516)),
       ('350 GIRARD', (37.727979, -122.404876)),
       ('200 MARINA BLVD', (37.806047, -122.435333)),
       ('200 MARINA', (37.806047, -122.435333)),
       ('100 GROVE', (37.778497, -122.418299)),
       ('200 MASON', (37.785332, -122.409528)),
       ('100 JESSIE', (37.788412, -122.400417)),
       ('547 CLEMENT', (37.782924, -122.465226)),
       ('700 JOHN MUIR', (37.709942, -122.488090))]
In [223]:
## Transfer data from list to dataframe
for a, gps in fix:
    address_gps_exact_fail.loc[a, 'median_lat'] = gps[0]
    address_gps_exact_fail.loc[a, 'median_lon'] = gps[1]

## Only keep rows with gps coordinates
address_gps_exact_fail.dropna(inplace = True)
print('Number of rows recovered: {:,}'.format(int(address_gps_exact_fail.no_gps.sum())))

## combine both together
address_gps_exact = pd.concat([address_gps_exact, address_gps_exact_fail])
## No longer needed columns
address_gps_exact.drop(columns=['total', 'has_gps', 'no_gps', 'pct_complete'], inplace=True)
## reset index for merging 
address_gps_exact.reset_index(inplace = True)
print('Final lookup table shape: {}'.format(address_gps_exact.shape))
Number of rows recovered: 14,271
Final lookup table shape: (53363, 3)
In [224]:
## Plot the imputations
pt = address_gps_exact
gps = gpd.GeoDataFrame(pt[['address']], crs ={'init': 'epsg:4269'},
                       geometry = [Point(xy) for xy in zip(pt.median_lon, pt.median_lat)])
    ## the base map (SF)
base = sf.plot(color='white', edgecolor='black', figsize=(30,5))
gps.plot(ax=base, marker='o', color='red', markersize=10)
plt.title('GPS locations inside Exact Lookup table')
plt.show()
In [225]:
## wrap up
del address_gps_exact_fail, fix, min_count_idx, min_compt_idx
del min_tickets, min_complete, a, gps, base

Impute with median gps coordinates from Exact Same Address

In [226]:
## Take the proportion that is gonna be imputed
to_impute = df_merge.loc[df_merge.gps_missing, ['ticket_number', 'address']].copy().reset_index()
print('N rows matches N missing gps: {}'.format(to_impute.shape[0] == df_merge.gps_missing.sum()))
N rows matches N missing gps: True
In [227]:
## Merge with address_gps_exact lookup table
imputed = pd.merge(to_impute, address_gps_exact, how = 'left', on = 'address')
print('# of rows imputed: {:,}'.format(imputed.median_lat.notna().sum()))
# of rows imputed: 442,060
In [228]:
## Drop all failed imputation
imputed.dropna(inplace = True)
## Set index for quick replacement of nan values
imputed.set_index('index', inplace = True)
In [229]:
## Verify that index still match with correct ticket
tmp = imputed.sample(n=20, random_state = 50)
print('Imputed index and ticket number match with df_merge: {}'\
      .format((df_merge.iloc[tmp.index, df_merge.columns.get_loc('ticket_number')] == \
               tmp.ticket_number).all()))
## glimpse at comparison
tmp.head(3)
Imputed index and ticket number match with df_merge: True
Out[229]:
ticket_number address median_lat median_lon
index
1246233 869121805 199 01ST ST 37.788587 -122.396057
3389420 PD31100075 20 JONES ST 37.781450 -122.412157
1715977 865445711 633 KIRKHAM ST 37.760250 -122.467823
In [230]:
## Replace missing with imputed table
df_merge.iloc[imputed.index, df_merge.columns.get_loc('hh_gpslat')] = imputed.median_lat
df_merge.iloc[imputed.index, df_merge.columns.get_loc('hh_gpslon')] = imputed.median_lon
## Record which rows were imputed
df_merge.iloc[imputed.index, df_merge.columns.get_loc('imputed')] = True
## update gps_missing
df_merge['gps_missing'] = df_merge.hh_gpslat.isna()
print('Current missing amount: {:,}'.format(df_merge.gps_missing.sum()))
Current missing amount: 977,283
In [231]:
## wrap up
del tmp, to_impute, imputed, address_gps_exact, pt

Blocking street address to every 100 street number.

Standard blocks are usually increment of 100.

In [232]:
## Need to extract street_no because some tickets has
## full address in the street_name column
street_no_extract = df_merge.address.str.partition(' ')
street_no_extract.head(2)
Out[232]:
0 1 2
0 4420 MISSION ST
1 834 47TH AVE
In [233]:
## flooring street_no to the 2nd digit
street_no_block = pd.to_numeric(street_no_extract[0], errors='coerce') // 100 * 100
## Convert street no to string, remove nan and tailing zero '.0'
street_no_block = street_no_block.astype('str').str.replace('nan', '').str.replace('\.0', '')
## combine street block with street name + suffix and remove white space
df_merge['block'] = street_no_block.str.cat(df_merge.street_partial, sep = ' ').str.strip()
## Verify via visual
df_merge[['address', 'block']].sample(n = 5, random_state = 33)
Out[233]:
address block
858621 2730 39TH AVE 2700 39TH AVE
1724069 3236 JUDAH ST 3200 JUDAH ST
1665359 750 AVALON AVE 700 AVALON AVE
2851903 856 THORNTON AVE 800 THORNTON AVE
3668904 77 COLBY ST 0 COLBY ST
In [234]:
## Create imputation table
address_gps_block = df_merge.groupby('block').apply(sum_street)

## Calculate cannot be imputed due to zero gps coordinate
print('Number of tickets that cannot be imputed: {:,}'\
      .format(int(address_gps_block.loc[address_gps_block.pct_complete == 0, 'no_gps'].sum())))
Number of tickets that cannot be imputed: 78,332
In [235]:
## Extract the ones that cannot be imputed to be manually fixed
address_gps_block_fail = address_gps_block[address_gps_block.pct_complete == 0]
## Keep rows with at least 100
address_gps_block_fail = address_gps_block_fail[address_gps_block_fail.total > 100]
## Sort for easy visual
address_gps_block_fail.sort_values(by = 'total', ascending = False, inplace = True)
print('N rows: {}'.format(address_gps_block_fail.shape[0]))
address_gps_block_fail.to_csv('dataset/address_gps_block_fail.csv')
address_gps_block_fail.head(3)
N rows: 54
Out[235]:
total has_gps no_gps pct_complete median_lat median_lon
block
E LOT 914.0 0.0 914.0 0.0 NaN NaN
860.0 0.0 860.0 0.0 NaN NaN
500 CALIFORNIA 495.0 0.0 495.0 0.0 NaN NaN

Most of the ones that cannot be fix are mainly because of ambiguous street name like California which can be street or avenue or E LOT which doesn't say which parking lot it is.

In [236]:
## Remove rows where gps is na which are the 0% and 100% missing address
address_gps_block.dropna(subset=['median_lat', 'median_lon'], inplace=True)
print('Number of address before dropping: {:,}'.format(address_gps_block.shape[0]))

## Keep if count is greater than mininum ticket count or completeness
min_tickets = 30
min_complete = 75
min_count_idx = address_gps_block.has_gps > min_tickets
min_compt_idx = address_gps_block.pct_complete > min_complete

print('Number of address after dropping: {:,}'.format((min_count_idx | min_compt_idx).sum()))
print('===============')
print('Number of tickets affected by drop: {:,}'\
      .format(int(address_gps_block.loc[~(min_count_idx | min_compt_idx), 'no_gps'].sum())))

## Drop rows that doesn't meet threshold
address_gps_block = address_gps_block[(min_count_idx | min_compt_idx)]

print('Number of tickets captured: {:,}'\
      .format(int(address_gps_block.no_gps.sum())))
Number of address before dropping: 12,148
Number of address after dropping: 9,275
===============
Number of tickets affected by drop: 28,870
Number of tickets captured: 870,081

Manually Fix a few

Many of are just missing its suffix. However, I decided to fix them via manually searching for coordinates instead of adding in the suffix.

In [237]:
## I chose the middle of the block
## Did not do all of the possible ones
fix = [('700 CLEMENT', (37.782829, -122.467150)),
       ('200 COLUMBUS', (37.797519, -122.406152)),
       ('200 MARKET', (37.792908, -122.396852)),
       ('1600 GENEVA', (37.711715, -122.428041)),
       ('400 CASTRO', (37.761642, -122.435136))
      ]
In [238]:
## Transfer data from list to dataframe
for a, gps in fix:
    address_gps_block_fail.loc[a, 'median_lat'] = gps[0]
    address_gps_block_fail.loc[a, 'median_lon'] = gps[1]

## Only keep rows with gps coordinates
address_gps_block_fail.dropna(inplace = True)
print('Number of rows recovered: {:,}'.format(int(address_gps_block_fail.no_gps.sum())))

## combine both together
address_gps_block = pd.concat([address_gps_block, address_gps_block_fail])
## No longer needed columns
address_gps_block.drop(columns=['total', 'has_gps', 'no_gps', 'pct_complete'], inplace=True)
## reset index for merging 
address_gps_block.reset_index(inplace = True)
print('Final lookup table shape: {}'.format(address_gps_block.shape))
Number of rows recovered: 833
Final lookup table shape: (9280, 3)
In [239]:
## Plot the imputations
pt = address_gps_block
gps = gpd.GeoDataFrame(pt[['block']], crs ={'init': 'epsg:4269'},
                       geometry = [Point(xy) for xy in zip(pt.median_lon, pt.median_lat)])
## the base map (SF)
base = sf.plot(color='white', edgecolor='black', figsize=(30,5))
gps.plot(ax=base, marker='o', color='red', markersize=10)
plt.title('GPS locations inside Block Lookup table')
plt.show()
In [240]:
## wrap up
del address_gps_block_fail, fix, min_count_idx, min_compt_idx, a, gps
del min_tickets, min_complete, street_no_block, street_no_extract

Impute with median gps coordinates from Blocking

In [241]:
## Take the proportion that is gonna be imputed
to_impute = df_merge.loc[df_merge.gps_missing, ['ticket_number', 'block']].copy().reset_index()
print('N rows matches N missing gps: {}'.format(to_impute.shape[0] == df_merge.gps_missing.sum()))
N rows matches N missing gps: True
In [242]:
## Merge with address_gps_exact lookup table
imputed = pd.merge(to_impute, address_gps_block, how = 'left', on = 'block')
print('# of rows imputed: {:,}'.format(imputed.median_lat.notna().sum()))
# of rows imputed: 870,914
In [243]:
## Drop all failed imputation
imputed.dropna(inplace = True)
## Set index for quick replacement of nan values
imputed.set_index('index', inplace = True)
In [244]:
## Verify that index still matches with correct ticket
tmp = imputed.sample(n=20, random_state = 50)
print('Imputed index and ticket number match with df_merge: {}'\
      .format((df_merge.iloc[tmp.index, df_merge.columns.get_loc('ticket_number')] == \
               tmp.ticket_number).all()))
## glimpse at comparison
tmp.head(3)
Imputed index and ticket number match with df_merge: True
Out[244]:
ticket_number block median_lat median_lon
index
1561949 852046436 700 FLORIDA ST 37.759720 -122.410895
1250326 869714543 700 AMAZON AVE 37.714984 -122.434211
1783115 873197614 1300 DOLORES ST 37.748092 -122.424770
In [245]:
## Replace missing with imputed table
df_merge.iloc[imputed.index, df_merge.columns.get_loc('hh_gpslat')] = imputed.median_lat
df_merge.iloc[imputed.index, df_merge.columns.get_loc('hh_gpslon')] = imputed.median_lon
## Record which rows were imputed
df_merge.iloc[imputed.index, df_merge.columns.get_loc('imputed')] = True
## update gps_missing
df_merge['gps_missing'] = df_merge.hh_gpslat.isna()
print('Current missing amount: {:,}'.format(df_merge.gps_missing.sum()))
Current missing amount: 106,369
In [246]:
## wrap up
del tmp, to_impute, imputed, address_gps_block
In [247]:
## Ploting all gps points 
sfplot(df_merge)

Try to recover the last bit by fixing the suffix(failed) and geocoding

This method was place at the end because it makes changes on the whole dataset and without a robust way to confirm that it works without any bugs and will not take a long time to complete. I will limit to the last bit which is only very small proportion. Plus streets with clearly wrong suffix will stay missing therefore, the position of this method doesn't matter. I was not able to make this work because it was hard fixing street with common words like MISSION ROCK and MISSION ST.

In [248]:
## Find Street with more than one suffix
street_more_suffix = off_street_names.groupby(['StreetName']).size()\
                        .to_frame(name = 'n_suffix').reset_index()
street_more_suffix = street_more_suffix[street_more_suffix.n_suffix > 1]
print(street_more_suffix.shape)
(146, 2)
In [249]:
## Extract the ones that are still missing
## And using block to impute 
to_impute = df_merge.loc[df_merge.gps_missing, ['ticket_number', 'address', 'block']]\
                .copy().reset_index()
print('N rows matches N missing gps: {}'.format(to_impute.shape[0] == df_merge.gps_missing.sum()))
N rows matches N missing gps: True
In [250]:
## Remove rows that clearly cannot be fixed
## Keep full address and cross streets
to_impute = to_impute[to_impute.block.str.contains(' AND ') |
                      to_impute.block.str.contains('^\d+ ', na=False)].copy()
print(to_impute.shape)
(86820, 4)
In [251]:
print('Before: {:,}'.format(to_impute.shape[0]))
## Remove streets with multiple possible suffix
for i, street in street_more_suffix.StreetName.iteritems():
    ## must match whole street
    match = to_impute.address.str.contains('\\b'+street+'\\b', na=False)
    if match.any():
        to_impute = to_impute[~match]
print('After drop: {:,}'.format(to_impute.shape[0]))
Before: 86,820
After drop: 66,901
In [252]:
## Obtain top # of blocks to geoencode
top = 2400
block_count = to_impute.block.value_counts(ascending=False)
top_block = block_count.index[:top]
print('Number of rows captured: {:,}'.format(block_count[:top].sum()))
Number of rows captured: 44,286
In [253]:
# ## Save geocode result
# geo_cache = pd.DataFrame({'lat': 0.0, 'lon': 0.0, 
#                           'confidence': 0, 'status': 'none'}, 
#                          index = top_block)

# ## Obtain lat and lon with google geocoder api
# ## by looping over the collected address
# for address in geo_cache.index.tolist()[357:]:
#     ## You need an api key or else you will get OVER_QUERY_LIMIT
#     ## on the first query
#     g = geocoder.google(address + ', San Francisco, CA', 
#                         key = 'YOUR-OWN')
#     if g.status == 'OVER_QUERY_LIMIT':
#         print('OVER_QUERY_LIMIT')
#         break
#     ## skip if not okay i.e ZERO_RESULTS
#     if g.status != 'OK':
#         geo_cache.at[address, 'status'] = g.status
#         continue
#     geo_cache.at[address, 'lat'] = g.latlng[0]
#     geo_cache.at[address, 'lon'] = g.latlng[1]
#     geo_cache.at[address, 'confidence'] = g.confidence
#     geo_cache.at[address, 'status'] = g.status

## Save result so I don't need to use geocode again. 
# geo_cache.to_csv('dataset/geo_cache_new.csv')
In [254]:
## read in previously geocoded address
geo_cache = pd.read_csv('dataset/geo_cache_final.csv', index_col = 0)
## Remove fail searches
geo_cache = geo_cache[geo_cache.status == 'OK']
geo_cache.index.name = 'block'
geo_cache.reset_index(inplace = True)
print(geo_cache.shape)
geo_cache.head(2)
(2392, 5)
Out[254]:
block confidence lat lon status
0 1600 CARROLL AVE 9 37.724993 -122.393398 OK
1 100 34TH AVE 9 37.784565 -122.500744 OK
In [255]:
## Determine geocoded points that are outside of boundary 
tmp = geo_cache.reset_index()
gps = gpd.GeoDataFrame(tmp[['index','block']], crs ={'init': 'epsg:4269'},
                       geometry = [Point(xy) for xy in zip(tmp.lon, tmp.lat)])
## Determine if gps coordinates are inside polygon
inOut = sjoin(gps, sf[['NAME', 'geometry']], how = 'left')
print('Number of rows outside of SF: {}'.format(inOut.NAME.isna().sum()))
base = sf.plot(color = 'white', edgecolor = 'black', figsize=(30,5))
## Blue if inside
inOut[inOut.NAME.notna()].plot(ax=base, marker='o', color='blue',markersize=10)
## Red if outside
inOut[inOut.NAME.isna()].plot(ax=base, marker='o', color='red',markersize=10)
plt.title('Geocode Points Inside vs Outside')
plt.show()
Number of rows outside of SF: 16
In [256]:
## set index
inOut.set_index('index', inplace=True)
## find points outside of SF
outside = inOut[inOut.NAME.isna()]
outside
Out[256]:
block geometry index_right NAME
index
47 2700 MISSION POINT (-122.4532692 37.6723342) NaN NaN
227 0 LIEBIG POINT (-122.4553447 37.7079575) NaN NaN
284 400 SCHWERIN ST POINT (-122.4127644 37.70802949999999) NaN NaN
447 600 LARCH POINT (-122.4149428 37.666676) NaN NaN
696 1200 MISSION POINT (-122.4401906 37.6624749) NaN NaN
1349 100 MARIPOSA ST POINT (-122.4007595 37.6842161) NaN NaN
1395 200 DONAHUE ST POINT (-122.5115841 37.8746375) NaN NaN
1812 400 MAPLE POINT (-122.4121579 37.6566735) NaN NaN
1830 100 LA SALLE AVE POINT (-122.2231074 37.81506270000001) NaN NaN
1845 300 KING POINT (-122.4481051 37.6535592) NaN NaN
2040 100 SILVER AVE POINT (-122.368742 37.9574361) NaN NaN
2064 JOHN F KENNEDY AND 9TH AVE POINT (-122.0546361 37.9567523) NaN NaN
2096 600 PINE POINT (-122.4132872 37.6587881) NaN NaN
2119 100 MARIN BLVD POINT (-122.5677263 37.9835875) NaN NaN
2133 200 HUDSON POINT (-122.242269 37.4822003) NaN NaN
2223 0 NICHOLAS WAY POINT (-122.0461558 37.95628789999999) NaN NaN
In [257]:
print('Before: {}'.format(geo_cache.shape[0]))
## Drop the bad points
geo_cache.drop(index=outside.index, inplace=True)
print('After: {}'.format(geo_cache.shape[0]))
Before: 2392
After: 2376
In [258]:
## Merge with address_gps_exact lookup table
imputed = pd.merge(to_impute, geo_cache, how = 'left', on = 'block')
print('# of rows imputed: {:,}'.format(imputed.lat.notna().sum()))
# of rows imputed: 43,835
In [259]:
## Drop all failed imputation
imputed.dropna(inplace = True)
## Set index for quick replacement of nan values
imputed.set_index('index', inplace = True)
In [260]:
## Verify that index still matches with correct ticket
tmp = imputed.sample(n=20, random_state = 50)
print('Imputed index and ticket number match with df_merge: {}'\
      .format((df_merge.iloc[tmp.index, df_merge.columns.get_loc('ticket_number')] == \
               tmp.ticket_number).all()))
## glimpse at comparison
tmp.head(3)
Imputed index and ticket number match with df_merge: True
Out[260]:
ticket_number address block confidence lat lon status
index
3551962 PD27592401 1836 SUNNYDALE 1800 SUNNYDALE 9.0 37.713245 -122.419764 OK
1940027 886312022 1632 CARROLL AVE 1600 CARROLL AVE 9.0 37.724993 -122.393398 OK
2311565 1004545323 228 CHURCH 200 CHURCH 9.0 37.767411 -122.428949 OK
In [261]:
## Replace missing with imputed table
df_merge.iloc[imputed.index, df_merge.columns.get_loc('hh_gpslat')] = imputed.lat
df_merge.iloc[imputed.index, df_merge.columns.get_loc('hh_gpslon')] = imputed.lon
## Record which rows were imputed
df_merge.iloc[imputed.index, df_merge.columns.get_loc('imputed')] = True
## update gps_missing
df_merge['gps_missing'] = df_merge.hh_gpslat.isna()
print('Current missing amount: {:,}'.format(df_merge.gps_missing.sum()))
Current missing amount: 62,534
In [262]:
## wrap up
del off_street_names, street_more_suffix, to_impute, block_count, top, top_block, imputed
del geo_cache, i, street, match, base, pt, inOut, gps, outside

Verify Imputations are good

In [263]:
## Check to see if data was overwritten
## gps that weren't missing should not be imputed
## should be false
print('There are gps data accidently overwritten: {}'\
      .format(((df_merge.gps_missing_record != df_merge.imputed) != df_merge.gps_missing).any()))
There are gps data accidently overwritten: False
In [264]:
sfplot(df_merge)
In [265]:
## Obtain three random streets with atleast 500 non-imputed and imputed rows
tmp = df_merge.groupby('tick_street_name').agg({'imputed': 'sum', 'gps_missing_record': 'sum'})
tmp[(tmp.imputed > 500) & (tmp.gps_missing_record > 500)].sample(n = 3, random_state = 42)
Out[265]:
imputed gps_missing_record
tick_street_name
LINCOLN WAY 1031.0 1051.0
43RD 578.0 611.0
NATOMA 1859.0 1896.0
In [266]:
## Export for tableau visualization
# df_merge.loc[df_merge.tick_street_name.isin(['MISSION', 'LINCOLN WAY', '43RD', 'NATOMA']), 
#              ['tick_street_name', 'hh_gpslat', 'hh_gpslon', 'imputed']]\
#              .to_csv('dataset/tableau/df_merge_impute.csv')
In [267]:
## Display original data first
Image(filename = os.path.join('.','image','imputed_3_not.png'),width = 500, height = 100)
Out[267]:
In [268]:
## Manually visuallize whether impute looks fine
Image(filename = os.path.join('.','image','imputed_3.png'),width = 500, height = 100)
Out[268]:
In [269]:
## Manually visuallize whether impute looks fine
## Display original data first
Image(filename = os.path.join('.','image','imputed_mission.png'),width = 500, height = 100)
Out[269]:

Conclusion: After checking 3 random and the street with most tickets, I conclude that imputation worked just fine.

Wrap up and Export

In [270]:
## make sure there weren't any duplicates
print('N dupluciates? {}'\
      .format(df_merge.duplicated(subset='ticket_number', keep=False).sum()))
N dupluciates? 0
In [271]:
## drop columns that aren't usefull anymore
df_merge.drop(columns = ['_merge', 'tick_street_no', 
                         'tick_street_suffix', 'gps_missing', 
                         'street_partial', 'block', 
                         'violation_code', 'violation', 
                         'violation_desc_long'], inplace = True)
In [272]:
print(df_merge.shape)
## reorder columns in a more meaningful way
df_merge = df_merge.reindex(copy = False, 
                            columns = ['issue_datetime', 'ticket_number',
                                       'tick_street_name','address', 'hh_gpslat', 
                                       'hh_gpslon', 'violation_desc_general', 
                                       'violation_group','tick_corr_type', 
                                       'tick_corr_type_desc_long', 'handheld', 
                                       'badge', 'state_full', 'plate', 'tick_vin',
                                       'tick_rp_vin', 'gps_missing_record', 
                                       'imputed', 'veh_body'])
(3923063, 19)
In [273]:
df_merge.info(null_counts=True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3923063 entries, 0 to 3923062
Data columns (total 19 columns):
issue_datetime              3923063 non-null datetime64[ns]
ticket_number               3923063 non-null object
tick_street_name            3923063 non-null object
address                     3923063 non-null object
hh_gpslat                   3860529 non-null float64
hh_gpslon                   3860529 non-null float64
violation_desc_general      3923063 non-null object
violation_group             3923063 non-null object
tick_corr_type              417896 non-null float64
tick_corr_type_desc_long    417896 non-null object
handheld                    3923063 non-null bool
badge                       3923063 non-null int16
state_full                  3923063 non-null object
plate                       3661904 non-null object
tick_vin                    3920824 non-null object
tick_rp_vin                 3551351 non-null object
gps_missing_record          3923063 non-null bool
imputed                     3923063 non-null bool
veh_body                    3897715 non-null object
dtypes: bool(3), datetime64[ns](1), float64(3), int16(1), object(11)
memory usage: 657.6+ MB
In [274]:
## Export for the team
df_merge.to_csv('dataset/df_merge_cleaned_sm.csv')
In [275]:
# ## export for me
df_merge.to_pickle('dataset/df_merge-sm.pickle', compression='gzip')
In [ ]:
## import backup
# df_merge = pd.read_pickle('dataset/df_merge.pickle')

Random

TODO List:

  • Need to look into tick_correction.
  • Do a random sampling of tick_vin and tick_rp_vin to see how many are actual mistyping error.
  • I should see if citations on a car without license plate, do they also get a citations for not having a license plate? Like double parking but found that it doesn't have license plate. I know for new car they have 60 days to register and get a license but is it possible to give a fix it ticket and if they don't have a license plate 60 days the ticket will be charged?
  • Specifically look for treasure island's ticket? like Ave B.

Questions:

  • Does ppl that have bad vin like cannot find, or read tend to do certain violations?
  • Do ppl with non-standard plate tend to do certain things.
  • What happens when a car is ticketed? Does it no longer get ticket for the rest of the day?
  • Is it possible for pda to check if a license plate is correct? More violation if found out it isn't
  • Figure out how many tickets are bogus
  • What is the minium information of a vehicle needed to ticket it?
  • ticket_number = 882030004 lLVESHOE SU 214 V5200 NO PLATES 248.0 TAYLOR ST NaN NaN NaN 0.000000 0.000000 both 2017-06-02 18:03:00 215 California. Is this a valid ticket?
  • 876288151 is HMMRTYME and got street cleaning but no information on the car even though the license plate is fake.

Random insight:

  • TRC7.2.101A FAILURE TO DISPLAY why is bus fare in this?
  • ticket number with 'G' and violations like GO1.I.6A are related to parking lot???
  • tick_vin should not be use to mark description like to high
  • officers has random habits of inputing vin like the last 4, random part of the vin etc. Also some misread the value like C when it was a 0. The length of the vin is random too. And there are people that X3060 1HGCD5635VA183060 put x in front of things.
  • What is the protocol when a car that is citated doesn't have a license plate and vin cannot be read? Does it get issued to the driver? what happens if the driver isn't there?
  • 854345741 864644620 person that the same time back to back. poor person.
  • Covering of vin is allowed.
  • TRC7.2.44A - Use of Lost/Stolen DP PL requires no interaction with the driver. 875. If it was wrong it is a fix it type ticket. This to TRC7.2.44C.
  • Off Street are not being ticketed very often maybe because they are privately own parking facility
  • TRC7.2.30A OT PRK DOWNTOWN TRC7.2.30B OT PRK OUT DOWNTOWN. Confused with this one. It is non-permit timed area. But also contain meters.